scrapy / scrapyd

A service daemon to run Scrapy spiders
https://scrapyd.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
2.92k stars 569 forks source link

Are Scrapyd log files created with delays? #503

Closed aaronm137 closed 2 months ago

aaronm137 commented 2 months ago

Hello,

I am running about 200 spiders every day at the same time. It's the same spider, but with just different parameters, hence it's being ran 200 times. My Scrapyd config has set 8 jobs on a single CPU core. Having 2 cores, I can run a total of 16 spiders at the same time.

Each spider creates a log file. What I do is that the log file is created on the server and once the spider is finished, my script takes it, uploads it to a 3rd party storage and deletes the original log file.

What I noticed was that sometimes, I received an error when the script tried to copy the log file to the server ([Errno 2] No such file or directory: 'logs/my_proj/my_proj/task_169_2024-06-14T13_55_48.log'). I logged to the server where Scrapyd is logging to see if there's the log file and it was not there.

For debugging purposes, I disabled my script to delete the log file. The error was gone, but I found out that Scrapyd didn't create the log file in the first place (hence, the log file was not copied to the 3rd party server).

What happened this morning (after 12 hours) when I tried to dig in again into this issue was that when I check the 3rd party server, the log files were successfully copied there. Which leads me to a question how Scrapyd creates the log files. Is there something like a queue when the log files are created on the server with some delay? What is causing this effect?

Thanks

jpmckinney commented 2 months ago

All Scrapyd does is override Scrapy's LOG_FILE setting if logs_dir is set in Scrapyd's configuration file.

I don't know on what schedule Scrapy itself creates log files. But, Scrapyd is not responsible for writing the file.

You might want to start a discussion or open an issue on Scrapy.