Refresh crawl status after long time leads to memory error.

WNiels commented 6 years ago

I have a crawl running, where after 87000 seconds since last refresh, the following error occures when trying to refresh:

Traceback (most recent call last): File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 2292, in wsgi_app response = self.full_dispatch_request() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1815, in full_dispatch_request rv = self.handle_user_exception(e) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1718, in handle_user_exception reraise(exc_type, exc_value, tb) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/_compat.py", line 35, in reraise raise value File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1813, in full_dispatch_request rv = self.dispatch_request() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/app.py", line 1799, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/flask/views.py", line 88, in view return self.dispatch_request(*args, **kwargs) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/directory/log.py", line 94, in dispatch_request self.request_scrapy_log() File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/directory/log.py", line 142, in request_scrapy_log self.status_code, self.text = self.make_request(self.url, api=False, auth=self.AUTH) File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/scrapydweb/myview.py", line 191, in make_request front = r.text[:min(100, len(r.text))].replace('\n', '') File "/root/anaconda3/envs/scrapyd_env/lib/python3.7/site-packages/requests/models.py", line 861, in text content = str(self.content, encoding, errors='replace') MemoryError

The crawl seems to be running fine thow.

my8100 commented 6 years ago

In my experience, it's due to insufficient memory. Could your tell me the size of current log file and your spare / total RAM.

my8100 commented 6 years ago

Also, if ScrapydWeb and Scrapyd run on the same host, you can set up the SCRAPYD_LOGS_DIR item to read local log file directly, which works only when your Scrapyd server is added as '127.0.0.1' in the config file of ScrapydWeb. Note that parsing the log file with regular expression still may cause memory error due to insufficient memory.

https://github.com/my8100/scrapydweb/blob/master/scrapydweb/default_settings.py#L60

# Set to speed up loading scrapy logs.
# e.g., 'C:/Users/username/logs/' or '/home/username/logs/'
# The setting takes effect only when both ScrapydWeb and Scrapyd run on the same machine,
# and the Scrapyd server ip is added as '127.0.0.1'.
# Check out here to find out where the Scrapy logs are stored:
# https://scrapyd.readthedocs.io/en/stable/config.html#logs-dir
SCRAPYD_LOGS_DIR = ''

WNiels commented 6 years ago

Thanks for the fast reply. I don't want to interrupt the crawling, but it should finish within a few days. Then i'll test the above and give an update.

my8100 commented 6 years ago

Actually, you only need to reconfig and restart ScrapydWeb, without interrupting your crawling.

my8100 commented 6 years ago

It's possible that you can't reproduce the problem after your crawling is finished, since there would be enough memory for ScrapydWeb to parse log. Or you can run another ScrapydWeb instance on other computer with enough memory, as a temporary solution.

WNiels commented 6 years ago

Ok, there's the issue. 600MB Ram left and a 800MB log.

my8100 commented 5 years ago

Fixed in v1.1.0: Now the large logfile would be cut into chunks and parsed periodically and incrementally with the help of LogParser.

my8100 / scrapydweb

Refresh crawl status after long time leads to memory error. #11