Jobs disappearing: is there a way to monitor crawls ?

benjaminelkrieff commented 3 years ago

Hello there.

I have been using scrapy-cluster for my project and so far the results are good. It was only until I started using it at scale. I have a spider that tries to authenticate to a website, does recaptcha solving and then scrapes the website. For example, out of 1000 requests, I see only 993 that completed the job. When I run the scale load with a more simple spider, with no login and no complex things, I have no problem.

I haven't been able yet to figure out where the problem was, and I suspect scrapy to be guilty part. It probably a request that has been ignored because of too many retries or other reasons (by the way I tried to increase the number of retry times and the same problem happens).

With all of these issues, I came to the conclusion that I need a platform to monitor the crawls.

Is there a way to monitor each crawl separately ?
Seeing the failed one, the number of retries etc ... ?
Why spiders have to always be running and not behave like a regular scrapy job ?
Is there a way to know how many crawls are currently running ?
I saw that you were developing a UI 2-3 years ago, what is the current status about it ?

Thank you for your time

madisonb commented 3 years ago

@benjaminelkrieff I would suggest the following options for you to help you understand your jobs

Turn on the Scrapy debug log LOG_ENABLED=True and set LOG_LEVEL=DEBUG
Turn on the Scrapy cluster debug log (different from scrapy) in your crawler via SC_LOG_LEVEL=DEBUG
Configure an ELK stack or log aggregation framework of your choice to view all your logs in a single place

With a combination of all three of those you should see if your scrapy cluster is working properly. To answer your bullet points directly:

You should have all the log data you need above ^^ to monitor your crawl job and success/fail rate
Yes, see the ELK piece
Scrapy cluster spiders are not tasked the same as normaly scrapy spiders, and poll the redis database for new jobs, so, they must always be running
The redis monitor logs are great for this, also the info command can help you
That UI is not developed by the team that supports this project, and was a community member contribution. You are welcome to revive it and get it back going

Given that I highly suspect this is a direct spider issue, vs a scrapy cluster coordination issue, I am going to suggest we close this ticket as I do not supply support for custom spider implementations inside of github per the issue guidelines

benjaminelkrieff commented 3 years ago

Hi and thank you for the help, once again. Thanks to your answers, everything is more clear now. You can close the ticket.

PS: I started integrating scrapyd and scrapydweb in the crawler container, is there be any reason you wouldn't recommend doing this ?

madisonb commented 3 years ago

I don't use Scrapyd so I can't comment on it - this project is open source so feel free to modify it to fit your needs to your hearts content.

Other distributed scrapy projects I've come across over the years can be found at https://scrapy-cluster.readthedocs.io/en/latest/topics/advanced/comparison.html

benjaminelkrieff commented 3 years ago

Thank you for the answer.

istresearch / scrapy-cluster

Jobs disappearing: is there a way to monitor crawls ? #251