rmax / scrapy-redis

Redis-based components for Scrapy.
http://scrapy-redis.readthedocs.io
MIT License
5.52k stars 1.59k forks source link

scheduler bug? #14

Closed b1shan closed 11 years ago

b1shan commented 11 years ago

The scheduler doesn't seem to respect "allowed_domains" and the settings like "DOWNLOAD_DELAY". Everything works fine when you disable scrapy-redis scheduler in settings. Tried looking at the code but can't find what's going wrong.

rmax commented 11 years ago

That's odd as the offset spider middleware takes care of the allowed domains and the core downloader takes care of the delay.

Which version of scrapy are you using?

b1shan commented 11 years ago

Sorry for not mentioning the version. Its Scrapy 0.18.3.

Yes its odd offsite spider middleware isn't kicking in as expected and so does the core downloader settings. I can't find anything explicitly wrong in your code too. Appreciate you looking in. Thanks.

b1shan commented 11 years ago

Any update, Rolando?

rmax commented 11 years ago

Sorry, I'm bit busy this days. However, I did a test with the example spiders modifying the allowed_domains attribute and it works right. The download delay works and the domains are getting filtered.

You must be aware that the start urls are not filtered by scrapy, that is, if you push to the start_urls queue an url within a non-allowed domain, is going to get crawled by the spider.

Can you provide a minimal project code that reproduces your problem?

b1shan commented 11 years ago

I too tried with example spider dmoz from git but no luck. As you see below the delay doesn't kick in

$scrapy crawl dmoz -s LOG_LEVEL=INFO -s DOWNLOAD_DELAY=0.5 2013-10-29 11:45:26+0530 [scrapy] INFO: Scrapy 0.18.3 started (bot: scrapybot) 2013-10-29 11:45:26+0530 [dmoz] INFO: Spider opened 2013-10-29 11:45:26+0530 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-10-29 11:46:26+0530 [dmoz] INFO: Crawled 95 pages (at 95 pages/min), scraped 190 items (at 190 items/min) 2013-10-29 11:47:26+0530 [dmoz] INFO: Crawled 194 pages (at 99 pages/min), scraped 1982 items (at 1792 items/min) 2013-10-29 11:48:26+0530 [dmoz] INFO: Crawled 295 pages (at 101 pages/min), scraped 3456 items (at 1474 items/min)

rmax commented 11 years ago

The delay looks right. If you set 0.5 delay, and have a latency per request of ~0.1 seconds, then the requests per minute are 60/0.6 ~ 100 pages/min.

What did you expect having set delay = .5? Try a value of 10 or greater to see how it waits between requests.

On Tue, Oct 29, 2013 at 3:39 AM, b1shan notifications@github.com wrote:

I too tried with example spider dmoz from git but no luck. As you see below the delay doesn't kick in

$scrapy crawl dmoz -s LOG_LEVEL=INFO -s DOWNLOAD_DELAY=0.5 2013-10-29 11:45:26+0530 [scrapy] INFO: Scrapy 0.18.3 started (bot: scrapybot) 2013-10-29 11:45:26+0530 [dmoz] INFO: Spider opened 2013-10-29 11:45:26+0530 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-10-29 11:46:26+0530 [dmoz] INFO: Crawled 95 pages (at 95 pages/min), scraped 190 items (at 190 items/min) 2013-10-29 11:47:26+0530 [dmoz] INFO: Crawled 194 pages (at 99 pages/min), scraped 1982 items (at 1792 items/min) 2013-10-29 11:48:26+0530 [dmoz] INFO: Crawled 295 pages (at 101 pages/min), scraped 3456 items (at 1474 items/min)

— Reply to this email directly or view it on GitHubhttps://github.com/darkrho/scrapy-redis/issues/14#issuecomment-27283262 .

b1shan commented 11 years ago

You are right. This one doesn't look bad. It was much worse.

I think my problem might be to do with SCHEDULER_PERSIST which was enabled by default in settings file in my first few runs.

For start_urls on redis queue I just sent one URL from the correct domain. Do you mean if we use such allowed_domains doesn't take effect?

b1shan commented 11 years ago

All seems good now. I don't push start_urls to redis and have scheduler persistence turned off. Thanks for your help and sharing this code.