Scrape delay - Githubissues

tluyben commented 3 years ago

Is there a delay between feeding to kafka and scraping? I set up docker from the master branch here, ran the unit tests (all succeed) in every container and sent off a feed. This works, but it takes minutes before crawled_firehose produces result for a tiny site that takes ms to get with wget on the same server. I tried many other sites, but the result is the same; minutes delay. There are no errors in any of the logs and the server is a monster with 0 load. Any idea what that could be?

madisonb commented 3 years ago

Double check what your QUEUE_HITS and QUEUE_WINDOW settings are (docs). By default the crawler cluster is relatively slow compared to a typical curl request but the settings are easily changed.

You should also be able to see how fast your crawl is eligible to be scraped by checking redis. If you have a domain queue key like <spiderid>:<domain>:queue show up almost instantly, you know the kafka monitor is working correctly (and fast). From there, your spiders check the queue every so often and then initiate the website request.

Lastly, you can also check these docs out, as they help explain the distributed throttling process. The only other thing I can think of is that the kafka buffer is a little slow but 60+ seconds seems a bit out of the ordinary.

Otherwise I would need more info about how to reproduce the issue.

madisonb commented 3 years ago

Closing due to lack of activity and no reproducible instructions.

istresearch / scrapy-cluster

Scrape delay #250