OSError: [Errno 24] Too many open files: 'data/mexico_administracion_publica_federal/20210215_211033/page-...

sentry-io[bot] commented 3 years ago

OSError: [Errno 24] Too many open files: 'data/mexico_administracion_publica_federal/20210215_211033/page-2845-62.json'
  File "/home/ocdskfs/scrapyd/.ve/lib/python3.6/site-packages/scrapy/utils/defer.py", line 164, in maybeDeferred_coro
  File "/home/ocdskfs/scrapyd/.ve/lib/python3.6/site-packages/pydispatch/robustapply.py", line 55, in robustApply
  File "/tmp/kingfisher-1613421542-a33lyf0v.egg/kingfisher_scrapy/extensions.py", line 88, in item_scraped
  File "/tmp/kingfisher-1613421542-a33lyf0v.egg/kingfisher_scrapy/extensions.py", line 102, in _write_file

Error caught on signal handler: <bound method KingfisherFilesStore.item_scraped of <kingfisher_scrapy.extensions.KingfisherFilesStore object at 0x7f4e858156d8>>

yolile commented 3 years ago

this happened while processing the spider mexico_administracion_publica_federal eg https://api.datos.gob.mx/v1/contratacionesabiertas each page has a results array with 100 record packages, which means that for each page we are generating and writing 100 files, and that is why we got the "too many open files" error, from both, the _write method and from _post to kingfisher process

related to #607

jpmckinney commented 3 years ago

I'm not 100% sure that this spider is the cause. If another spider was running at the same time and opened too many files, then any code that tried to open files would error. For example, the error in #607 is caused by OpenSSL attempting to open CA certificates, but the problem isn't with OpenSSL.

Looking at https://collect.kingfisher.open-contracting.org/jobs, the CompressedFileSpider spiders (which don't close files) running on February 15 included:

argentina_buenos_aires (358 files written)
costa_rica_poder_judicial_releases (14648)
mexico_nuevo_leon_records (19)
uruguay_historical (382)

That said, I ran them one at a time locally and didn't have an issue. However, I think macOS might dynamically change the max open files, because when I started ulimit -n was 256, and now it's 2560. Update: Indeed, my hard limit (ulimit -Hn) is unlimited. However, Scrapyd's hard limit (cat /proc/PID/limits, replacing PID) is 4096, although the ocdskfs user's hard limit (ulimit -Hn) is 1048576.

Scrapy has some settings to control concurrent processing like CONCURRENT_ITEMS. If the problem persists after fixing CompressedFileSpider, we can consider those settings.

jpmckinney commented 3 years ago

Aha, this might also be an issue:

https://github.com/open-contracting/kingfisher-collect/blob/c059ecd33873e6da67ae19a962ba7006243f3cdc/kingfisher_scrapy/extensions.py#L212

Update: Nevermind - FILES_STORE is set, so that code is never run.

yolile commented 3 years ago

from that list, costa_rica_poder_judicial_releases is the most "problematic" one as it has only 1 zip file with 14.648 JSON files inside, and in CompressedFileSpider we open each of them and yield File items super fast. So I will add a code to close the files in the ReadData Middleware

yolile commented 3 years ago

related to this, while testing this with kingfisher process locally, I also got this error: https://sentry.io/organizations/open-contracting-partnership/issues/2080886892/?project=1404981&query=is%3Aunresolved . I'm not sure what the cause could be. Maybe we are sending too many items to kingfisher process? and we should use CONCURRENT_ITEMS as well?

jpmckinney commented 3 years ago

Aha, because the current Kingfisher Process creates a collection when it receives the first file, there can be a race condition. The new Kingfisher Process doesn't have this problem. The race condition seems to occur very rarely (only 4 times in Sentry), so I don't think we need to address this now, since the new Kingfisher Process is nearly ready.

yolile commented 3 years ago

closed by #613

jpmckinney commented 3 years ago

@yolile This is now occurring occasionally for Ukraine.

Got the PID of the crawler: ps aux | grep scrapy (22214)
cat /proc/22214/limits: Hard limit of open files is 4096
ll /proc/22214/fd: Currently has <1000 file descriptors

Does Ukraine yield results especially fast? (Like OpenOpps?) If so, maybe we just need to increase the ulimit (which I thought I had done in https://github.com/open-contracting/deploy/issues/246 but the limit is still 4096).

Update: I don't see any DefaultLimitNOFILE in /etc/systemd/system.conf, /etc/systemd/user.conf or /etc/security/limits.conf or LimitNOFILE in /etc/systemd/system/scrapyd.service

Aha, found it now: https://github.com/systemd/systemd/blob/main/NEWS We're running systemd --version 237, which passes on the Kernel's limit of 4096. We'd need systemd 240 to go to 512K by default. We can either set LimitNOFILE in the meantime, or just wait for Dogsbody to upgrade to Ubuntu 20.04.

The Linux kernel's current default RLIMIT_NOFILE resource limit for userspace processes is set to 1024 (soft) and 4096 (hard). Previously, systemd passed this on unmodified to all processes it forked off. With this systemd release the hard limit systemd passes on is increased to 512K, overriding the kernel's defaults and substantially increasing the number of simultaneous file descriptors unprivileged userspace processes can allocate ...

yolile commented 3 years ago

Does Ukraine yield results especially fast?

Yes, their API is fast and we are scraping new pages with 100 ocids each. I canceled the job now as it consumed the Sentry quota. I can also make the spider slower by setting concurrent_request = 1

yolile commented 3 years ago

We can either set LimitNOFILE in the meantime

That was implemented and already deployed on the server.

I can also make the spider slower by setting concurrent_request = 1

I removed this temporary fix from the ukraine spider and started a new crawl on Saturday. The error appeared again around an hour ago.

cat /proc/22214/limits: Hard limit of open files is 4096

I checked for this and the Hard limit is still 4096, even if LimitNOFILE in /etc/systemd/system/scrapyd.service is 512000 now. I stopped the spider again.

jpmckinney commented 3 years ago

Limit doesn't seem to take effect:

$ grep NOFILE /etc/systemd/system/scrapyd.service
LimitNOFILE=512000 # https://github.com/open-contracting/kingfisher-collect/issues/606#issuecomment-953896571
$ systemctl daemon-reload
$ service scrapyd stop
$ service scrapyd start
$ for pid in $(pidof python3); do cat /proc/$pid/limits | grep open; done
Max open files            1024                 4096                 files     
Max open files            1024                 4096                 files

(Using systemctl instead of service makes no difference.)

The service is run by the ocdskfs user, whose limit is high:

$ ulimit -Hn
1048576

No other NOFILE is set in:

grep -ri NOFILE /etc/
grep -ri NOFILE /usr/lib/systemd/

Checking sysctl, those limits are also high:

$ sysctl -a
…
fs.file-max = 13148218
…
$ cat /proc/sys/fs/file-max
13148218

jpmckinney commented 3 years ago

I'll ask Dogsbody. It's not making sense.

ghost commented 3 years ago

I found in the scrapyd logs an error that the systemd service file (/etc/systemd/system/scrapyd.service) was invalid.

Nov 02 14:57:31 ocp04 systemd[1]: /etc/systemd/system/scrapyd.service:22: Failed to parse resource value, ignoring: 512000 # https://github.com/open-contracting/kingfisher-collect...

I fixed this by hand, putting the github reference comment on a new line which has fixed this.

root@ocp04 ~ # cat /proc/17384/limits | grep open # 17384 = scrapyd Max open files 512000 512000 files

Is there an existing branch / pull request I can make this change in?

jpmckinney commented 3 years ago

@RobHooper Where is the location of that log file, for future reference?

@yolile can make and deploy the change.

ghost commented 3 years ago

It was logged into syslog - /var/log/syslog.1 I saw it in systemctl status scrapyd

yolile commented 3 years ago

@yolile can make and deploy the change.

Done. I will start the spider again now to test if the issue appears again or not anymore this time.

yolile commented 3 years ago

The spider has been running for 3 days now and the error didn't occur anymore. Most of the opened files are connections waiting for being closed from Kingfisher Process. This will also not occur anymore with the new version of Kingfisher Process.

open-contracting / kingfisher-collect

OSError: [Errno 24] Too many open files: 'data/mexico_administracion_publica_federal/20210215_211033/page-... #606