istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 324 forks source link

CentOS 7 Compatibility. Merged with dev branch. #67

Closed knirbhay closed 8 years ago

knirbhay commented 8 years ago

Travis build is running and all tests are passing for both OS (centos 7 + ubuntu trusty) inside docker.

I am getting one error log due to which build is failing in my fork

You have to provide either repo_token in .coveralls.yml, or launch via Travis or CircleCI

It looks like the https://coveralls.io/ requires all coverall calls to go from istresearch/scrapy-cluster fork.

Let me know if anything is required to do the merge.

madisonb commented 8 years ago

Interesting that the coveralls push is failing. It may be due to the fact that you moved the coveralls command to run inside of the docker container instead of in the normal Travis machine. Coveralls has a built in integration with Travis that the docker container does not most likely.

We should probably copy the coverage report from the docker container to the host file system, then run the command

madisonb commented 8 years ago

We also need a way to fail the build if the tests don't pass, as far as I know the way the docker exec command works is that it will plow through all of the bash commands without regards to the exit code. Because sudo docker exec... will always exit 0, we won't ever actually fail the build.

There are three points of failure

So we can either add a bunch of bash logic to somehow check return codes and exit accordingly, or we could add a third build job to both run everything in docker (centos and ubuntu check) and then run a third job to run it in the native Travis vm.

Our travis build would then have three parts 1) Ubuntu Docker container (will never fail) 2) Centos Docker container (will never fail) 3) Travis Vm running the normal scripts, so ansible or test failures will trigger the build to fail.

Thoughts?

The travis script would have an environment flag similar to a big if-else block that would run either the docker setup or the older ansible/test setup. Like

if $flag docker exec ...
# all docker commands like in your PR
if not $flag ./run_offline_tests.sh
# all normal commands like before, so the exit code will trigger a build failure
knirbhay commented 8 years ago

Another thought is we can grep the test result and make docker send exit code based on the result. For Ansible following command will do the trick.

sudo docker exec --tty "$(cat ${container_id})" env TERM=xterm /bin/bash -c "ansible-playbook -i ~/scrapy-cluster/ansible/travis.inventory ~/scrapy-cluster/ansible/scrapy-cluster.yml --connection=local --become 
| grep -q 'changed=0.*failed=0'
    && (echo 'Ansible deployment: pass' && exit 0)
    || (echo 'Ansible deployment: fail' && exit 1)
"

Let me verify these points of failure by forcefully failing it one by one inside docker.

  1. Ansible doesn't successfully work
  2. Offline tests fail
  3. Online tests fail

What do you think?

knirbhay commented 8 years ago

Interesting that the coveralls push is failing. It may be due to the fact that you moved the coveralls command to run inside of the docker container instead of in the normal Travis machine. Coveralls has a built in integration with Travis that the docker container does not most likely.

We should probably copy the coverage report from the docker container to the host file system, then run the command

We probably do not event need to copy coverage reports. Because I have mounted the source code folder inside docker. So whatever coverage report gets generated inside docker will be available outside the docker. We just have to run coverall command from outside docker container.

I will verify it and let you know.

madisonb commented 8 years ago

Ok so the coveralls update works, hence the reason why it used to be

after_success:
- pip install coveralls; coveralls

but I can always change it back once we get everything else working.

The unit tests will fail with the text FAILED (errors=1) if you put a generic raise Exception() inside of any of the tests, so we can grep for that too. So we would need to do it twice, once for after the offline tests and once for the online tests?

madisonb commented 8 years ago

Actually I think I like this syntax better so it proigates the error codes and exits, and we dont have to use grep. Both Ansible and the test shell scripts will have non-zero exit codes upon errors, so we can just do:

ansible-playbook.... || (exit $?); .... ; ./run_offline_tests.sh || (exit $?); ./run_online_tests || (exit $?);

see here towards the bottom

knirbhay commented 8 years ago

Awesome. Will do these changes on my local machine and try to complete it asap.

madisonb commented 8 years ago

So this purposefully failed then, correct? the Travis build appears to have failed with the proper exit code 1 since you put some exceptions in the unit tests.

knirbhay commented 8 years ago

Yeap that's correct. But still I am verifying it with few more use cases, once verification is complete. I will let you know.

knirbhay commented 8 years ago

I am not 100% sure but it looks like docker has covered the returning exitcode in this Issue Add support for docker exec to return cmd exitStatus . After verifications, I can confirm our Travis setup is working fine without the forceful return of exit code.

knirbhay commented 8 years ago

Everything looks ok except following.

Even though I am running code outside the container. It looks like coveralls is pointing to the wrong dir. Any Insight?

Installing collected packages: docopt, coveralls
Successfully installed coveralls-1.1 docopt-0.6.2
Submitting coverage to coveralls.io...
No source for /root/sc/lib/python2.7/site-packages/scutils-1.2.0.dev4-py2.7.egg/scutils/__init__.py
No source for /root/sc/lib/python2.7/site-packages/scutils-1.2.0.dev4-py2.7.egg/scutils/log_factory.py
No source for /root/sc/lib/python2.7/site-packages/scutils-1.2.0.dev4-py2.7.egg/scutils/method_timer.py
No source for /root/sc/lib/python2.7/site-packages/scutils-1.2.0.dev4-py2.7.egg/scutils/redis_queue.py
No source for /root/sc/lib/python2.7/site-packages/scutils-1.2.0.dev4-py2.7.egg/scutils/redis_throttled_queue.py
No source for /root/sc/lib/python2.7/site-packages/scutils-1.2.0.dev4-py2.7.egg/scutils/settings_wrapper.py
No source for /root/sc/lib/python2.7/site-packages/scutils-1.2.0.dev4-py2.7.egg/scutils/stats_collector.py
No source for /root/scrapy-cluster/crawler/crawling/__init__.py
No source for /root/scrapy-cluster/crawler/crawling/contextfactory.py
No source for /root/scrapy-cluster/crawler/crawling/custom_cookies.py
No source for /root/scrapy-cluster/crawler/crawling/distributed_scheduler.py
No source for /root/scrapy-cluster/crawler/crawling/items.py
No source for /root/scrapy-cluster/crawler/crawling/log_retry_middleware.py
No source for /root/scrapy-cluster/crawler/crawling/pipelines.py
No source for /root/scrapy-cluster/crawler/crawling/redis_dupefilter.py
No source for /root/scrapy-cluster/crawler/crawling/redis_retry_middleware.py
No source for /root/scrapy-cluster/crawler/crawling/spiders/__init__.py
No source for /root/scrapy-cluster/crawler/crawling/spiders/link_spider.py
No source for /root/scrapy-cluster/crawler/crawling/spiders/lxmlhtml.py
No source for /root/scrapy-cluster/crawler/crawling/spiders/redis_spider.py
No source for /root/scrapy-cluster/crawler/crawling/spiders/wandering_spider.py
No source for /root/scrapy-cluster/kafka-monitor/kafka_monitor.py
No source for /root/scrapy-cluster/kafka-monitor/plugins/__init__.py
No source for /root/scrapy-cluster/kafka-monitor/plugins/action_handler.py
No source for /root/scrapy-cluster/kafka-monitor/plugins/base_handler.py
No source for /root/scrapy-cluster/kafka-monitor/plugins/scraper_handler.py
No source for /root/scrapy-cluster/kafka-monitor/plugins/stats_handler.py
No source for /root/scrapy-cluster/kafka-monitor/plugins/zookeeper_handler.py
No source for /root/scrapy-cluster/redis-monitor/plugins/__init__.py
No source for /root/scrapy-cluster/redis-monitor/plugins/base_monitor.py
No source for /root/scrapy-cluster/redis-monitor/plugins/expire_monitor.py
No source for /root/scrapy-cluster/redis-monitor/plugins/info_monitor.py
No source for /root/scrapy-cluster/redis-monitor/plugins/kafka_base_monitor.py
No source for /root/scrapy-cluster/redis-monitor/plugins/stats_monitor.py
No source for /root/scrapy-cluster/redis-monitor/plugins/stop_monitor.py
No source for /root/scrapy-cluster/redis-monitor/plugins/zookeeper_monitor.py
No source for /root/scrapy-cluster/redis-monitor/redis_monitor.py
No source for /root/scrapy-cluster/utils/scutils/__init__.py
Coverage submitted!
madisonb commented 8 years ago

That is much cleaner, the exit 1's were because you chained everything together, but now that you split it up that works nicely.

I can accept running coveralls not in after_success. There are a number of Travis tickets related to only executing one time when the build matrix succeeds (vs every single time). According to this link the latest build ran works with sub builds, so if you add back in the coverage in the main script section to be ran within docker exec that is working.

Nice work so far! Thanks again for this we are very close!

madisonb commented 8 years ago

I have begun merging your branch into the main repo see here. Also added a commit to test the coverage report inside the script

madisonb commented 8 years ago

Ah so I think the issue is that the coverage report includes direct paths to the source files that it executed on. Because the Docker volume mount path is not the same on each side, it can't find the original source files (since the test scripts were ran inside the container at /root/scrapy-cluster and the original code lives at pwd we either need to find/replace or mount things at the same path. Here is an example of the .coverage output when ran locally:

...
"/Users/madisonb/Documents/Memex/scrapy-cluster/kafka-monitor/kafka_monitor.py": [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
...

As you can see, it creates a report with full paths specified, and when we are outside the container we dont have the same full path.

madisonb commented 8 years ago

For when you see this, I got the centos build working on my branch but the ubuntu one is not. Also got coverage working for the build that passes. https://travis-ci.org/istresearch/scrapy-cluster/builds/143163692

I really just need /tmp/ mounted in the docker container but it doesn't seem to be happening on the ubuntu matrix build.

knirbhay commented 8 years ago

This is strange because centos worked well. Let me check today if I can mount both source locations in correct places.

coveralls commented 8 years ago

Coverage Status

Coverage decreased (-2.2%) to 62.461% when pulling 778675a14f61a290d0ad40cee98adb232d3ba681 on knirbhay:dev into d1f9999dec5db3450608169c6e1b7c9b8898ee87 on istresearch:dev.

coveralls commented 8 years ago

Coverage Status

Coverage decreased (-2.2%) to 62.461% when pulling 23bc767eaa74076e45d1e092d372b08f2cdaa10e on knirbhay:dev into d1f9999dec5db3450608169c6e1b7c9b8898ee87 on istresearch:dev.

knirbhay commented 8 years ago

This trick worked. I changed all the paths with ${PWD} so that it mounts the fs in container at the same location as host.

But there are few files related to virtualenv. I believe these reports need not go to the coverage report. Is there a way to ignore these files or should I create virtual env inside source folder?

No source for /root/sc/lib/python2.7/site-packages/scutils-1.2.0.dev4-py2.7.egg/scutils/__init__.py
No source for /root/sc/lib/python2.7/site-packages/scutils-1.2.0.dev4-py2.7.egg/scutils/log_factory.py
No source for /root/sc/lib/python2.7/site-packages/scutils-1.2.0.dev4-py2.7.egg/scutils/method_timer.py
No source for /root/sc/lib/python2.7/site-packages/scutils-1.2.0.dev4-py2.7.egg/scutils/redis_queue.py
No source for /root/sc/lib/python2.7/site-packages/scutils-1.2.0.dev4-py2.7.egg/scutils/redis_throttled_queue.py
No source for /root/sc/lib/python2.7/site-packages/scutils-1.2.0.dev4-py2.7.egg/scutils/settings_wrapper.py
No source for /root/sc/lib/python2.7/site-packages/scutils-1.2.0.dev4-py2.7.egg/scutils/stats_collector.py
Coverage submitted!
Couldn't find a repository matching this job.
'url'
madisonb commented 8 years ago

The coverage decreased because it apparently couldn't find those files. The coverage report from your latest commit has 31 files listed, where the old reports have 38 (thus, the average would be slightly different since those files are no longer there). We actually do want those files.

See this vs yours. This is because the sc virtualenv directory isnt mounted the same way the repo is. I say the best solution is to make the virtualenv not at ~/sc but inside of the mounted repo like ~/scrapy-cluster/sc. I may have time to do this today but I will drop a note in here if I can get to it.

madisonb commented 8 years ago

With the commit I added dc472a46796a86ed19bd5f67c44abfbb2ffeae56 we are back to our expected coverage. I think this satisfies everything and if you are happy I am! Awesome!

knirbhay commented 8 years ago

Looks good to me.