scrapinghub / scrapyrt

HTTP API for Scrapy spiders
BSD 3-Clause "New" or "Revised" License
833 stars 162 forks source link

Running out of file descriptors due to per-spider log files #9

Closed andrewbaxter closed 9 years ago

andrewbaxter commented 9 years ago

Looking at lsof output - there's only 2-3 tcp connections, 99% of the open files are for the per-spider logs.

When this happens scrapyrt stops responding to requests.

It happens about once a day under lowish load (only a couple requests a minute).

dpnova commented 9 years ago

Python doesn't release file handlers to log files by default iirc. I'm not sure about the specifics of twisted logging in this regard but I would imagine it's fairly similar. Perhaps scrapyrt just needs to explicitly close the file handler at the end of the crawl?

chekunkov commented 9 years ago

@dpnova if you're right I assume it's rather Scrapy issue and should be reported there - ScrapyRT doesn't work with spiders' logs directly

@andrewbaxter have you had a chance to investigate this issue a bit deeper?

chekunkov commented 9 years ago

Another thing to mention - we are running instance which served 60k responses during last 5 days - load is approximately the same @andrewbaxter mentioned maybe even lower as we are caching results, but it was running non-stop during a couple of weeks - still no issues with file descriptors.

andrewbaxter commented 9 years ago

Sorry, I haven't. I figured it is working in your case, but unfortunately I don't have any insight into what we're doing differently.

It may be a Scrapy issue; I didn't think it was relevant to that project because Scrapy doesn't run multiple spiders AFAIK and so a log leak would probably not be a concern.

chekunkov commented 9 years ago

@andrewbaxter #13 and #14 resolved this issue, please update to recent ScrapyRT version and reopen this ticket if issue still exists.