Closed waldner closed 6 years ago
By default, ScrapydWeb would cache utf8 and stats files in the background periodically to speed up the loading of utf8 and stats html. If all cachings (it means fetching all logs of running and finished job of all Scrapyd servers and generating the corresponding utf8 and stats html files) are done within 300s, it would wait until reaching 300s and then start a new round of caching. But if the last round of caching took more than 300s, the new round would start immediately. Please check the setting of CACHE_INTERVAL_SECONDS, you may increase the interval or even set DISABLE_CACHE to True if there are too many logs after your Scrapyd server started up
DISABLE_CACHE = False
CACHE_INTERVAL_SECONDS = 300
And it tries to locate scrapy log in below order, if no JOBJD.log found, then tries JOBID.log.gz, and you may adjust the order for your own case.
SCRAPYD_LOG_EXTENSIONS = ['.log', '.log.gz', '.gz', '.txt', '']
Can you show me some key logs of ScrapydWeb?
BTW, how can you run a machine with two scrapyd instances, by docker?
By default, ScrapydWeb would cache utf8 and stats files in the background periodically to speed up the loading of utf8 and stats html
I understand that scrapydweb wants to cache files, but it seems to me that once a log file is fetched from the source and scrapyd has marked the job as finished it will not change anymore, there's no need to waste CPU and bandwidth refetching it constantly (what if one has 20 or 100 scrapyd instances, each with multiple projects and spiders?).
please check the setting of CACHE_INTERVAL_SECONDS.
My CACHE_INTERVAL_SECONDS
is set to 300 as well. Does it mean that scrapydweb will try to refresh its cache every 300 seconds? If it's so, then it's not working correctly, since it fetches the logs much more often than every 300 seconds.
How can you run a machine with two scrapyd instances, by docker?
Yes I'm using docker to run both scrapyd and scrapydweb.
Can you show me some key logs of ScrapydWeb?
Scrapydweb's logs just show a continuous stream of POSTs to 127.0.0.1 (I suppose to update its caches with the logs fetched from the scrapyd instances):
127.0.0.1 - - [14/Oct/2018 17:35:39] "POST /2/log/stats/project2/spider1/spider1_2018-10-10_16-56-55/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:35:42] "POST /2/log/utf8/project2/spider2/spider2_2018-10-10_17-02-13/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:04] "POST /2/log/stats/project2/spider2/spider2_2018-10-10_17-02-13/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:05] "POST /1/log/utf8/project1/spider3/spider3_2018-10-11_03-02-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:05] "POST /2/log/stats/project1/spider3/spider3_2018-10-11_03-02-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:06] "POST /1/log/utf8/project1/spider4/spider4_2018-10-11_03-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:07] "POST /2/log/stats/project1/spider4/spider4_2018-10-11_03-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:08] "POST /1/log/utf8/project1/spider1/spider1_2018-10-11_03-42-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:14] "POST /2/log/stats/project1/spider1/spider1_2018-10-11_03-42-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:14] "POST /1/log/utf8/project1/spider5/spider5_2018-10-11_04-02-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider5/spider5_2018-10-11_04-02-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /1/log/utf8/project1/spider6/spider6_2018-10-11_04-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider6/spider6_2018-10-11_04-22-02/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /1/log/utf8/project1/spider7/spider7_2018-10-11_04-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:15] "POST /2/log/stats/project1/spider7/spider7_2018-10-11_04-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:16] "POST /1/log/utf8/project1/spider8/spider8_2018-10-11_05-02-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:18] "POST /2/log/stats/project1/spider8/spider8_2018-10-11_05-02-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:18] "POST /1/log/utf8/project1/spider9/spider9_2018-10-11_05-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /2/log/stats/project1/spider9/spider9_2018-10-11_05-42-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /1/log/utf8/project1/spider10/spider10_2018-10-11_06-22-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:21] "POST /2/log/stats/project1/spider10/spider10_2018-10-11_06-22-01/ HTTP/1.1" 200 -
127.0.0.1 - - [14/Oct/2018 17:36:22] "POST /1/log/utf8/project1/spider11/spider11_2018-10-11_06-02-01/ HTTP/1.1" 200 -
...
I have updated my first comment, would imporve the caching mechannism in the next version. Thank you for your advice!
At this point, you may also comment out the following code to disable caching Scrapy log of finished jobs:
https://github.com/my8100/scrapydweb/blob/master/scrapydweb/cache.py#L52
update_cache('finished')
Commenting out that line makes things much better. Thanks!
@waldner Do you have any other advices for ScrapydWeb, like the Overview page and the Run page, or any other needs? Thank you in advance.
Right now I don't have any suggestion, if I happen to stumble across some other issue I will report it then. Thanks!
@waldner v0.9.6 Update caching mechanism: finished job would be cached only once
I have a machine with two scrapyd instances and one scrapydweb running, scrapydweb is connected to both scrapyd instances. However, CPU usage of scrapydweb is very high all the time. Investigating a bit, I've seen that scrapydweb is constantly making requests (more than 2 per second) to both scrapyd instances, requesting logs (the request is also done twice, once asking for the uncompressed log and once for the compressed one).
Now my question is, why does scrapydweb need to be constantly fetching scrapyd logs? I mean, once it has got them once, they aren't going to change.