openaustralia / morph

Take the hassle out of web scraping
https://morph.io
GNU Affero General Public License v3.0
461 stars 74 forks source link

Git submodules cause problems #456

Closed mlandauer closed 9 years ago

mlandauer commented 10 years ago

https://github.com/spendright-scrapers/companies

Errno::ENOENT - No such file or directory - /var/www/releases/20140920073127/db/scrapers/repos/spendright-scrapers/companies/srs

Backtrace

line 231 of [PROJECT_ROOT]/lib/morph/container_compiler.rb: read
line 231 of [PROJECT_ROOT]/lib/morph/container_compiler.rb: block in all_hash
line 224 of [PROJECT_ROOT]/lib/morph/container_compiler.rb: all_hash
line 216 of [PROJECT_ROOT]/lib/morph/container_compiler.rb: all_config_hash
line 192 of [PROJECT_ROOT]/lib/morph/container_compiler.rb: all_config_hash_with_defaults
line 130 of [PROJECT_ROOT]/lib/morph/container_compiler.rb: tar_config_files
line 84 of [PROJECT_ROOT]/lib/morph/container_compiler.rb: compile
line 93 of [PROJECT_ROOT]/lib/morph/container_compiler.rb: compile_and_run_with_buildpacks
line 115 of [PROJECT_ROOT]/app/models/run.rb: go_with_logging
line 175 of [PROJECT_ROOT]/app/models/run.rb: go!
line 185 of [PROJECT_ROOT]/app/models/run.rb: synch_and_go!
line 5 of [PROJECT_ROOT]/app/workers/run_worker.rb: perform

View full backtrace and more info at honeybadger.io

spendright-scrapers commented 10 years ago

I'm actually still having a problem with submodules now that I'm no longer in experimental buildpack mode.

spendright-scrapers commented 10 years ago

Thought it might be sort of a caching issue, so I made a trivial change to bump the git revision number. No luck; still no submodules.

henare commented 10 years ago

Yeah this isn't right - we use submodules on PlanningAlerts scrapers.

They don't store stuff in subdirectories though so maybe try flattening your submodule repository.

spendright-scrapers commented 10 years ago

Can you please point me to a scraper that uses submodules and works correctly? (There are lots of PlanningAlerts scrapers.)

henare commented 10 years ago

https://github.com/planningalerts-scrapers/uralla

They all run with buildpacks too.

henare commented 10 years ago

Oh duh, that builds a Rubygem and doesn't use submodules. Ignore me!

henare commented 10 years ago

Here we go: https://github.com/planningalerts-scrapers/sutherland

henare commented 10 years ago

I think it might be the symlink in your repo. Why not just put the submodule there instead of the symlink?

spendright-scrapers commented 10 years ago

Mostly because the subdir makes it easier to pack up the submodule as a package. Worth a try though; doing this now.

spendright-scrapers commented 10 years ago

Actually, I think the problem is that I was using the git@... URL for my submodule rather than the https:// one.

Now my scraper appears stalled, but that might have to do with complications related to changing the submodule URL (you don't do git submodule sync). Will try deleting the submodule and re-adding it.

spendright-scrapers commented 10 years ago

Oh, nah, seems to be an operational issue.

Oh well, going to bed (in California). Hopefully this will be working in the morning, and then maybe we can re-enable buildpacks? Thanks for all your help.

I'm @davidmarin, by the way (I created a separate account because I balked at giving morph.io write permission to all my repos).

henare commented 10 years ago

Oh, nah, seems to be an operational issue.

:(

Oh well, going to bed (in California). Hopefully this will be working in the morning, and then maybe we can re-enable buildpacks? Thanks for all your help.

Just ping me if you want buildpacks enabled.

I'm @davidmarin, by the way (I created a separate account because I balked at giving morph.io write permission to all my repos).

Nice to meet you David :)

spendright-scrapers commented 10 years ago

Okay, @henare, my campaigns and companies scrapers are now working again! I ultimately had to delete and re-create the scrapers; changing the URL of a git submodule isn't something that morph.io could handle (maybe running git submodule sync would help?).

Thanks for all your help! Would love it if you could re-enable buildpacks. :)

henare commented 10 years ago

Thanks for all your help! Would love it if you could re-enable buildpacks. :)

Done!

spendright-scrapers commented 10 years ago

Sorry, still having problems that I don't have the access to figure out on my own. See #458.

spendright-scrapers commented 10 years ago

Looks like submodules indeed are a problem with buildpacks. I create a test scraper with a trivial scraper; it ran fine on its own, but when I added submodules (https://github.com/spendright-scrapers/test/tree/60d71b4910f337e0930b767a1313c93485b8dca5), it stalled forever.

spendright-scrapers commented 10 years ago

Also, deleting the submodule (in git) doesn't fix the problem; you have to delete the scraper and start over.

spendright-scrapers commented 10 years ago

Doesn't matter if the submodule is inside a subdirectory (submodules/) or in the top-level directory.

henare commented 10 years ago

Sorry I haven't been able to help @spendright-scrapers, we're about to launch a new project so we'll all flat out.

Let me know if you want me to disable buildpacks for you.

spendright-scrapers commented 10 years ago

That's okay, it looks like you can work around the submodule issue by using a git URL in requirements.txt; you just have to make your submodule work as a package (i.e. add setup.py).

See https://github.com/spendright-scrapers/test/tree/d20a2b6543d02d78d616ac34f6505070cd54b545 for an example.

spendright-scrapers commented 10 years ago

Also confirmed that I can use runtime.txt to use Python 3.4.1 (though dumptruck isn't compatible with Python 3).

spendright-scrapers commented 10 years ago

Actually, @henare, yes, please disable experimental buildpack support. My workaround doesn't seem to work in practice (tried it on my companies and it still hangs) and without console support (#427), I can't even tell what's going wrong.

I'll make a note to check back in a couple months to see how buildpacks are doing. Would be nice to be able to write this in Python 3, but at the moment it's not practical.

Thanks!

henare commented 10 years ago

Done.

spendright-scrapers commented 10 years ago

Thank you!

coyotemarin commented 9 years ago

@henare, any progress on this? SpendRight is using Python 3 internally, and it would be nice to update our scrapers as well.

henare commented 9 years ago

@davidmarin this issue is not something I'm actively working on, sorry.

coyotemarin commented 9 years ago

No worries! Thanks for your honesty.

mlandauer commented 9 years ago

@davidmarin I've been doing some major work on the buildpack support recently including fixing a long-standing permissions issue which was probably at the root of some of the problems you experienced before when you switched to buildpacks.

This is all with the aim of getting buildpacks to the stage where we can switch everyone over to it seamlessly.

I'd be very keen on you trying buildpacks again with the incentive for you being that you can use Python 3, if you're up for it?

Let me know if you'd like me to reenable buildpacks for @spendright-scrapers

mlandauer commented 9 years ago

I'm going to close this issue. Please reopen it, or create a new issue if you experience this problem again

spendright-scrapers commented 9 years ago

@mlandauer, sure, let's give it a try! Would love to get my entire codebase on Python 3 eventually (though that will mean porting dumptruck as well).

Also, I've been wondering how you set up scrapers for an organization (e.g. openaustralia). Ideally, I'd like the main repo for these scrapers to be under the spendright organization on GitHub.

mlandauer commented 9 years ago

@spendright-scrapers I've enabled buildpacks on https://morph.io/spendright-scrapers so now all scrapers running under https://morph.io/spendright-scrapers will use buildpacks automatically.

To set up an organization, go to github and set up an organization there, make yourself a public member of that organization.

The only hassle right now is for morph.io to pick that up you might need to log out and log back in on morph. Then, you should see your membership of that organization on your user page on morph https://morph.io/spendright-scrapers.

Then, you can create scrapers under your organization in exactly the same way. Anyone that you make a public member of the organization will be able to do the same.

Hope that helps.

spendright-scrapers commented 9 years ago

@mlandauer Thanks!

Unfortunately, it looks like my scrapers still totally don't work with buildpacks (the symptom is that they can't import code from the srs submodule, I think same as last time). Can you please switch me back off buildpacks until the submodule issue is resolved?

The console does work a lot better with buildpacks now; nice!

mlandauer commented 9 years ago

@spendright-scrapers I'll switch off buildpacks for you right away. Could you possibly write some minimal test code that shows the problem you're seeing so I can reproduce it at my end?

mlandauer commented 9 years ago

@spendright-scrapers buildpack stuff switch off for you now

mlandauer commented 9 years ago

@spendright-scrapers sorry. don't worry about the minimal test code for the moment. I'll start by forking your scraper spendright-scrapers/companies and check that I can reproduce the problem

mlandauer commented 9 years ago

@spendright-scrapers I can reproduce the problem locally so that's most of the battle over. Now to fix it!

mlandauer commented 9 years ago

Part of the issue here is that only certain files from your repo are injected into the scraper container during the "compile" phase where it parses requirements.txt and runtime.txt to install the libraries and the right version of python.

So, it doesn't at this stage have access to the submodule just your requirements.txt file in the root of the directory.

This is what I tried.

First, installing srs directly from git by putting this in requirements.txt:

git+http://github.com/spendright-scrapers/srs.git@e3b09f6#egg=srs

This avoids the need for submodules.

But this ends up not installing the libraries in the requirements.txt file inside srs. My understanding of pip is a bit limited so I'm not sure why that's happening.

So then just to see what would happen I added the contents of the srs requirements file to the main requirements file to give:

git+http://github.com/spendright-scrapers/srs.git@e3b09f6#egg=srs
Unidecode==0.04.9
beautifulsoup4==4.1.3
dumptruck==0.1.6
html5lib==0.90  # used by beautifulsoup4
requests==1.0.4  # used by srs.vendor.reppy

This now gives the following error when you run the scraper:

Traceback (most recent call last):
  File "scraper.py", line 26, in <module>
    from srs.db import use_decimal_type_in_sqlite
  File "/app/.heroku/python/lib/python2.7/site-packages/srs/db.py", line 12, in <module>
     from .scrape import download
  File "/app/.heroku/python/lib/python2.7/site-packages/srs/scrape.py", line 16, in <module>
    from .vendor.reppy.cache import RobotsCache
ImportError: No module named vendor.reppy.cache

@spendright-scrapers @davidmarin What shall I try next?

mlandauer commented 9 years ago

I've discovered that another issue is that symbolic links are not getting injected into the container when it injects the source of the repo. This issue is in #562

spendright-scrapers commented 9 years ago

Awesome, thanks, go for it!

I have my own poor-man's version of morph.io in production, so it's not the end of the world if my scrapers are broken on morph.io. I just don't have the spare cycles to poke at it and try to make it work right now.

mlandauer commented 9 years ago

I'm reopening until we know that this is definitely fixed

mlandauer commented 9 years ago

After you merged the PR I've switched buildpacks back on for you @spendright-scrapers @davidmarin. Hopefully the scraper for you now works with buildpacks. If you do have another issue don't hesitate to open another ticket here. Thanks!