istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.17k stars 323 forks source link

Jobs disappearing: is there a way to monitor crawls ? #251

Closed benjaminelkrieff closed 3 years ago

benjaminelkrieff commented 3 years ago

Hello there.

I have been using scrapy-cluster for my project and so far the results are good. It was only until I started using it at scale. I have a spider that tries to authenticate to a website, does recaptcha solving and then scrapes the website. For example, out of 1000 requests, I see only 993 that completed the job. When I run the scale load with a more simple spider, with no login and no complex things, I have no problem.

I haven't been able yet to figure out where the problem was, and I suspect scrapy to be guilty part. It probably a request that has been ignored because of too many retries or other reasons (by the way I tried to increase the number of retry times and the same problem happens).

With all of these issues, I came to the conclusion that I need a platform to monitor the crawls.

Thank you for your time

madisonb commented 3 years ago

@benjaminelkrieff I would suggest the following options for you to help you understand your jobs

With a combination of all three of those you should see if your scrapy cluster is working properly. To answer your bullet points directly:

Given that I highly suspect this is a direct spider issue, vs a scrapy cluster coordination issue, I am going to suggest we close this ticket as I do not supply support for custom spider implementations inside of github per the issue guidelines

benjaminelkrieff commented 3 years ago

Hi and thank you for the help, once again. Thanks to your answers, everything is more clear now. You can close the ticket.

PS: I started integrating scrapyd and scrapydweb in the crawler container, is there be any reason you wouldn't recommend doing this ?

madisonb commented 3 years ago

I don't use Scrapyd so I can't comment on it - this project is open source so feel free to modify it to fit your needs to your hearts content.

Other distributed scrapy projects I've come across over the years can be found at https://scrapy-cluster.readthedocs.io/en/latest/topics/advanced/comparison.html

benjaminelkrieff commented 3 years ago

Thank you for the answer.