Closed sentry-io[bot] closed 3 years ago
this happened while processing the spider mexico_administracion_publica_federal eg https://api.datos.gob.mx/v1/contratacionesabiertas each page has a results array with 100 record packages, which means that for each page we are generating and writing 100 files, and that is why we got the "too many open files" error, from both, the _write method and from _post to kingfisher process
related to #607
I'm not 100% sure that this spider is the cause. If another spider was running at the same time and opened too many files, then any code that tried to open files would error. For example, the error in #607 is caused by OpenSSL attempting to open CA certificates, but the problem isn't with OpenSSL.
Looking at https://collect.kingfisher.open-contracting.org/jobs, the CompressedFileSpider
spiders (which don't close files) running on February 15 included:
That said, I ran them one at a time locally and didn't have an issue. However, I think macOS might dynamically change the max open files, because when I started ulimit -n
was 256, and now it's 2560. Update: Indeed, my hard limit (ulimit -Hn
) is unlimited. However, Scrapyd's hard limit (cat /proc/PID/limits
, replacing PID
) is 4096, although the ocdskfs
user's hard limit (ulimit -Hn
) is 1048576.
Scrapy has some settings to control concurrent processing like CONCURRENT_ITEMS. If the problem persists after fixing CompressedFileSpider
, we can consider those settings.
Aha, this might also be an issue:
Update: Nevermind - FILES_STORE
is set, so that code is never run.
from that list, costa_rica_poder_judicial_releases is the most "problematic" one as it has only 1 zip file with 14.648 JSON files inside, and in CompressedFileSpider
we open each of them and yield File items super fast. So I will add a code to close the files in the ReadData
Middleware
related to this, while testing this with kingfisher process locally, I also got this error: https://sentry.io/organizations/open-contracting-partnership/issues/2080886892/?project=1404981&query=is%3Aunresolved . I'm not sure what the cause could be. Maybe we are sending too many items to kingfisher process? and we should use CONCURRENT_ITEMS as well?
Aha, because the current Kingfisher Process creates a collection when it receives the first file, there can be a race condition. The new Kingfisher Process doesn't have this problem. The race condition seems to occur very rarely (only 4 times in Sentry), so I don't think we need to address this now, since the new Kingfisher Process is nearly ready.
closed by #613
@yolile This is now occurring occasionally for Ukraine.
ps aux | grep scrapy
(22214)cat /proc/22214/limits
: Hard limit of open files is 4096ll /proc/22214/fd
: Currently has <1000 file descriptorsDoes Ukraine yield results especially fast? (Like OpenOpps?) If so, maybe we just need to increase the ulimit (which I thought I had done in https://github.com/open-contracting/deploy/issues/246 but the limit is still 4096).
Update: I don't see any DefaultLimitNOFILE in /etc/systemd/system.conf, /etc/systemd/user.conf or /etc/security/limits.conf or LimitNOFILE in /etc/systemd/system/scrapyd.service
Aha, found it now: https://github.com/systemd/systemd/blob/main/NEWS We're running systemd --version
237, which passes on the Kernel's limit of 4096. We'd need systemd 240 to go to 512K by default. We can either set LimitNOFILE in the meantime, or just wait for Dogsbody to upgrade to Ubuntu 20.04.
The Linux kernel's current default RLIMIT_NOFILE resource limit for userspace processes is set to 1024 (soft) and 4096 (hard). Previously, systemd passed this on unmodified to all processes it forked off. With this systemd release the hard limit systemd passes on is increased to 512K, overriding the kernel's defaults and substantially increasing the number of simultaneous file descriptors unprivileged userspace processes can allocate ...
Does Ukraine yield results especially fast?
Yes, their API is fast and we are scraping new pages with 100 ocids each. I canceled the job now as it consumed the Sentry quota. I can also make the spider slower by setting concurrent_request = 1
We can either set LimitNOFILE in the meantime
That was implemented and already deployed on the server.
I can also make the spider slower by setting concurrent_request = 1
I removed this temporary fix from the ukraine
spider and started a new crawl on Saturday. The error appeared again around an hour ago.
cat /proc/22214/limits: Hard limit of open files is 4096
I checked for this and the Hard limit is still 4096, even if LimitNOFILE in /etc/systemd/system/scrapyd.service is 512000 now. I stopped the spider again.
Limit doesn't seem to take effect:
$ grep NOFILE /etc/systemd/system/scrapyd.service
LimitNOFILE=512000 # https://github.com/open-contracting/kingfisher-collect/issues/606#issuecomment-953896571
$ systemctl daemon-reload
$ service scrapyd stop
$ service scrapyd start
$ for pid in $(pidof python3); do cat /proc/$pid/limits | grep open; done
Max open files 1024 4096 files
Max open files 1024 4096 files
(Using systemctl
instead of service
makes no difference.)
The service is run by the ocdskfs user, whose limit is high:
$ ulimit -Hn
1048576
No other NOFILE
is set in:
grep -ri NOFILE /etc/
grep -ri NOFILE /usr/lib/systemd/
Checking sysctl, those limits are also high:
$ sysctl -a
…
fs.file-max = 13148218
…
$ cat /proc/sys/fs/file-max
13148218
I'll ask Dogsbody. It's not making sense.
I found in the scrapyd logs an error that the systemd service file (/etc/systemd/system/scrapyd.service) was invalid.
Nov 02 14:57:31 ocp04 systemd[1]: /etc/systemd/system/scrapyd.service:22: Failed to parse resource value, ignoring: 512000 # https://github.com/open-contracting/kingfisher-collect...
I fixed this by hand, putting the github reference comment on a new line which has fixed this.
root@ocp04 ~ # cat /proc/17384/limits | grep open # 17384 = scrapyd Max open files 512000 512000 files
Is there an existing branch / pull request I can make this change in?
@RobHooper Where is the location of that log file, for future reference?
@yolile can make and deploy the change.
It was logged into syslog - /var/log/syslog.1
I saw it in systemctl status scrapyd
@yolile can make and deploy the change.
Done. I will start the spider again now to test if the issue appears again or not anymore this time.
The spider has been running for 3 days now and the error didn't occur anymore. Most of the opened files are connections waiting for being closed from Kingfisher Process. This will also not occur anymore with the new version of Kingfisher Process.
Sentry Issue: KINGFISHER-COLLECT-M