Scrapydweb constantly using 100% CPU

waldner commented 6 years ago

I have a machine with two scrapyd instances and one scrapydweb running, scrapydweb is connected to both scrapyd instances. However, CPU usage of scrapydweb is very high all the time. Investigating a bit, I've seen that scrapydweb is constantly making requests (more than 2 per second) to both scrapyd instances, requesting logs (the request is also done twice, once asking for the uncompressed log and once for the compressed one).

Now my question is, why does scrapydweb need to be constantly fetching scrapyd logs? I mean, once it has got them once, they aren't going to change.

my8100 commented 6 years ago

By default, ScrapydWeb would cache utf8 and stats files in the background periodically to speed up the loading of utf8 and stats html. If all cachings (it means fetching all logs of running and finished job of all Scrapyd servers and generating the corresponding utf8 and stats html files) are done within 300s, it would wait until reaching 300s and then start a new round of caching. But if the last round of caching took more than 300s, the new round would start immediately. Please check the setting of CACHE_INTERVAL_SECONDS, you may increase the interval or even set DISABLE_CACHE to True if there are too many logs after your Scrapyd server started up

DISABLE_CACHE = False
CACHE_INTERVAL_SECONDS = 300

And it tries to locate scrapy log in below order, if no JOBJD.log found, then tries JOBID.log.gz, and you may adjust the order for your own case.

SCRAPYD_LOG_EXTENSIONS = ['.log', '.log.gz', '.gz', '.txt', '']

Can you show me some key logs of ScrapydWeb?

BTW, how can you run a machine with two scrapyd instances, by docker?

waldner commented 6 years ago

By default, ScrapydWeb would cache utf8 and stats files in the background periodically to speed up the loading of utf8 and stats html

I understand that scrapydweb wants to cache files, but it seems to me that once a log file is fetched from the source and scrapyd has marked the job as finished it will not change anymore, there's no need to waste CPU and bandwidth refetching it constantly (what if one has 20 or 100 scrapyd instances, each with multiple projects and spiders?).

please check the setting of CACHE_INTERVAL_SECONDS.

My CACHE_INTERVAL_SECONDS is set to 300 as well. Does it mean that scrapydweb will try to refresh its cache every 300 seconds? If it's so, then it's not working correctly, since it fetches the logs much more often than every 300 seconds.

How can you run a machine with two scrapyd instances, by docker?

Yes I'm using docker to run both scrapyd and scrapydweb.

Can you show me some key logs of ScrapydWeb?

Scrapydweb's logs just show a continuous stream of POSTs to 127.0.0.1 (I suppose to update its caches with the logs fetched from the scrapyd instances):

127.0.0.1 - - [14/Oct/2018 17:35:39] "POST /2/log/stats/project2/spider1/spider1_2018-10-10_16-56-55/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:35:42] "POST /2/log/utf8/project2/spider2/spider2_2018-10-10_17-02-13/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:04] "POST /2/log/stats/project2/spider2/spider2_2018-10-10_17-02-13/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:05] "POST /1/log/utf8/project1/spider3/spider3_2018-10-11_03-02-01/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:36:05] "POST /2/log/stats/project1/spider3/spider3_2018-10-11_03-02-01/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:36:06] "POST /1/log/utf8/project1/spider4/spider4_2018-10-11_03-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:07] "POST /2/log/stats/project1/spider4/spider4_2018-10-11_03-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:08] "POST /1/log/utf8/project1/spider1/spider1_2018-10-11_03-42-02/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:36:14] "POST /2/log/stats/project1/spider1/spider1_2018-10-11_03-42-02/ HTTP/1.1" 200 - 
127.0.0.1 - - [14/Oct/2018 17:36:14] "POST /1/log/utf8/project1/spider5/spider5_2018-10-11_04-02-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider5/spider5_2018-10-11_04-02-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /1/log/utf8/project1/spider6/spider6_2018-10-11_04-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider6/spider6_2018-10-11_04-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /1/log/utf8/project1/spider7/spider7_2018-10-11_04-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider7/spider7_2018-10-11_04-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:16] "POST /1/log/utf8/project1/spider8/spider8_2018-10-11_05-02-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:18] "POST /2/log/stats/project1/spider8/spider8_2018-10-11_05-02-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:18] "POST /1/log/utf8/project1/spider9/spider9_2018-10-11_05-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /2/log/stats/project1/spider9/spider9_2018-10-11_05-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /1/log/utf8/project1/spider10/spider10_2018-10-11_06-22-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /2/log/stats/project1/spider10/spider10_2018-10-11_06-22-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:22] "POST /1/log/utf8/project1/spider11/spider11_2018-10-11_06-02-01/ HTTP/1.1" 200 -
...

my8100 commented 6 years ago

I have updated my first comment, would imporve the caching mechannism in the next version. Thank you for your advice!

At this point, you may also comment out the following code to disable caching Scrapy log of finished jobs: https://github.com/my8100/scrapydweb/blob/master/scrapydweb/cache.py#L52 update_cache('finished')

waldner commented 6 years ago

Commenting out that line makes things much better. Thanks!

my8100 commented 6 years ago

@waldner Do you have any other advices for ScrapydWeb, like the Overview page and the Run page, or any other needs? Thank you in advance.

waldner commented 6 years ago

Right now I don't have any suggestion, if I happen to stumble across some other issue I will report it then. Thanks!

my8100 commented 6 years ago

@waldner v0.9.6 Update caching mechanism: finished job would be cached only once

my8100 / scrapydweb

Scrapydweb constantly using 100% CPU #3