my8100 / scrapydweb

Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO :point_right:
https://github.com/my8100/files
GNU General Public License v3.0
3.15k stars 562 forks source link

Jobs are killed without a clear reason #233

Closed payala closed 4 months ago

payala commented 4 months ago

Describe the bug In some situations which I haven't narrowed down yet, the jobs are being killed. It's not clear why, which makes it difficult to fix the problem

To Reproduce Steps to reproduce the behavior: Start multiple jobs and let them run, doesn't happen with around 5 jobs, but once there are between 8 to 15 jobs, this starts to happen.

Expected behavior That the jobs are not killed, and if they are killed to be able to see the reason why they were killed.

Logs "project": null, "version_spider_job": null } [2024-06-18 17:49:24,558] DEBUG in ApiView: view_args of http://localhost:62287/1/api/daemonstatus/ { "node": 1, "opt": "daemonstatus", "project": null, "version_spider_job": null } [2024-06-18 17:49:24,560] DEBUG in ApiView: view_args of http://localhost:62287/1/api/daemonstatus/ { "node": 1, "opt": "daemonstatus", "project": null, "version_spider_job": null } [2024-06-18 17:49:24,562] DEBUG in ApiView: view_args of http://localhost:62287/1/api/daemonstatus/ { "node": 1, "opt": "daemonstatus", "project": null, "version_spider_job": null }

Screenshots image

Environment (please complete the following information):

Additional context Add any other context about the problem here.

my8100 commented 4 months ago

Looks like your spider job was killed by the system for "Out of Memory" (OOM) or other reasons. It’s not an issue of scrapydweb and you need to check out the log of the docker.

payala commented 4 months ago

Alright, thanks for the feedback. I was able to find the exact time where all the jobs were killed, but I didn't find anything in scrapydweb logs.

But now, it's not clear which docker logs I should check, I guess it's more then on the scrapyd side, right?

So then, from what you say, I understand that scrapydweb just found that jobs that were running, were suddenly killed and marks them like that. The reason of the kill is external to scrapydweb, probably something that scrapyd itself decided to do if I got it right.

payala commented 4 months ago

OK, indeed, the scrapyd pod was killed by k8s due to too much memory usage.

So, now I understand what scrapydweb means here. It's indeed not like I originally understood that scrapydweb made the decision to kill that process, but rather that he saw it suddenly dissapear, and then it marks it as "Kill" meaning "Someone killed it, it suddenly dissapeared"

It would still be nice (maybe there is and I just don't know) to have a way to clear these lines to acknowledge them. It's very nice indeed that they stand up, so you notice them.

my8100 commented 4 months ago

ScrapydWeb shows “Kill pid” just to tell the user that he may need to quit the spider job by killing the process. It happens when scrapyd is restarted and some job PIDs are no longer seen from the scrapyd api. The underlying cause needs to be found from the host OS manually.

It would be nice if you could share the cmd for troubleshooting.

my8100 commented 4 months ago

You may try out the new implement of scrapyd.jobstorage.SqliteJobStorage of scrapyd if you find that only scrapyd is restarted while the spider processs still active.

https://github.com/scrapy/scrapyd/pull/418/commits/5d088bdfe948addad4cc9e3bf0c9156b53732665

payala commented 4 months ago

Not sure which cmd you refer to for sharing. But on my side everything is clear now. The scrapyd pod was indeed being restarted by k8s due to high memory usage. I found now a good combination of maximum concurrent processes in scrapyd and pod memory that reduces the chances of this happening. So all good from my side now at least.