Closed benjaminelkrieff closed 3 years ago
@benjaminelkrieff I would suggest the following options for you to help you understand your jobs
LOG_ENABLED=True
and set LOG_LEVEL=DEBUG
SC_LOG_LEVEL=DEBUG
With a combination of all three of those you should see if your scrapy cluster is working properly. To answer your bullet points directly:
Given that I highly suspect this is a direct spider issue, vs a scrapy cluster coordination issue, I am going to suggest we close this ticket as I do not supply support for custom spider implementations inside of github per the issue guidelines
Hi and thank you for the help, once again. Thanks to your answers, everything is more clear now. You can close the ticket.
PS: I started integrating scrapyd and scrapydweb in the crawler container, is there be any reason you wouldn't recommend doing this ?
I don't use Scrapyd so I can't comment on it - this project is open source so feel free to modify it to fit your needs to your hearts content.
Other distributed scrapy projects I've come across over the years can be found at https://scrapy-cluster.readthedocs.io/en/latest/topics/advanced/comparison.html
Thank you for the answer.
Hello there.
I have been using scrapy-cluster for my project and so far the results are good. It was only until I started using it at scale. I have a spider that tries to authenticate to a website, does recaptcha solving and then scrapes the website. For example, out of 1000 requests, I see only 993 that completed the job. When I run the scale load with a more simple spider, with no login and no complex things, I have no problem.
I haven't been able yet to figure out where the problem was, and I suspect scrapy to be guilty part. It probably a request that has been ignored because of too many retries or other reasons (by the way I tried to increase the number of retry times and the same problem happens).
With all of these issues, I came to the conclusion that I need a platform to monitor the crawls.
Thank you for your time