Open parmentelat opened 6 years ago
interesting read possibly/probably related here
https://github.com/elastic/elasticsearch/issues/18022
ulimit -n
shows the allowed number of open files; this seems to be on a per-process basis
the odd thing is from within a container we can see some very decent limit in the 10**6, but in the host it shows only 1024
besides docker seems to have a --ulimit
option; however as per the above, this does not seem to be our problem
global kernel setting is through /proc/sys/fs/file-max
right now thurst has 79190755; thermals has 26372286 (exactly one third)
it looks sa if it is proportional to the amount of memory
that does not seem to be the problem here
I'm getting the same numbers from within a container, so again the settings seem fine
all this boils down to, the culprit seems to be the per-process limit in the root context. Could it affect dockerd ?
indeed dockerd does complain; as an experiment I did a live change with
root@thurst ~ (master *) # ulimit -n -S
1024
root@thurst ~ (master *) # ulimit -n -H
4096
root@thurst ~ (master *) # ulimit -n 20480
root@thurst ~ (master *) # ulimit -n -H
20480
root@thurst ~ (master *) # ulimit -n -S
20480
which did not seem to fix the problem, at least not without restarting anything
EDIT:
root@thurst ~ (master *) # grep LimitNOFILE /usr/lib/systemd/system/docker.service
LimitNOFILE=1048576
also maybe worth investigating, as it could be related I guess, the absence of /etc/docker/daemon.json
triggers this
Jan 19 10:47:30 thurst dockerd-current[1571]: time="2018-01-19T10:47:30.203203015+01:00" level=info msg="Got signal to reload configuration, reloading from: /etc/docker/daemon.json"
Jan 19 10:47:30 thurst dockerd-current[1571]: time="2018-01-19T10:47:30.203267653+01:00" level=error msg="open /etc/docker/daemon.json: no such file or directory"
made an attempt at creating an empty /etc/docker/daemon.json
on thermals
this seems to not create any additional problem when done live; however after a reboot this all ends up in nbhosting emitting an internal error 500, so that's not good at all
useful comments to monitor the issue
# running containers
root@thurst ~ (master *) # docker ps | grep -v IMAGE | wc -l
41
# number of files open by the docker daemon
root@thurst ~ (master *) # ls /proc/$(pgrep dockerd)/fd | wc -l
242
# or
root@thurst ~ (master *) # lsof -p $(pgrep dockerd) | wc -l
287
# total number of open files in the system
root@thurst ~ (master *) # lsof | wc -l
1148699
# cat /proc/sys/fs/file-max
79190755
don't get it...
as of jan 22 2018
I am adding lsof
inside the flotpython3 image to help troubleshoot this leakage
also:
dockers=$(docker ps --format '{{.Names}}')
for docker in $dockers; do
echo -n ==================== $docker " "
docker exec -ti $docker lsof | wc -l
done
possible breakthrough; when inside a container:
root@f60d53d63043:~# ulimit -n
1048576
BUT
root@f60d53d63043:~# su - uid39096
No directory, logging in with HOME=/
uid39096@f60d53d63043:/$ id
uid=39096(uid39096) gid=39096(uid39096) groups=39096(uid39096)
uid39096@f60d53d63043:/$ ulimit -n
1024
in the commit below, an attempt to get this right is made in the dockerfile for the mooc in production
parmentelat/flotpython@f141802cf2cc5fd481831ba48e553e74098e988b
the image we use is based on ubuntu
when run as root, processes in the container have a decent limit of 1024*1024 open files
but regular user processes see a limit of 1024 open files which apparently is not enough for, I expect, the jupyter server process
proposed patch is to create a limit file in /etc/security/limits.d/
to increase this
users who need an upper limit might be the ones who play a lot with pdflatex when they download the notebooks as pdf. What is truly amazing though is that some containers have been observed with as many as several hundreds of thousands of entries in lsof
.
Putting this deployment in observation; rolling out the new image is done lazily, hence it takes a good day before it's really effective
unfortunately, I could see the bad symptom again today.
it actually pops out immediately when you look at the CPU load, my conjecture is that when the issue triggers CPU load gets a steady increase. See load chart on nbhosting.inria.fr
on Feb. 3 2018, between 22:00 and 23:00, load raises from 0 to 2 and won't get back down.
there were only 2 active containers when I ran this issue just now feb 6 circa 10:20; surprisingly one had been up for 2 days, which should not happen in normal circumstances.
I restarted the docker daemon - which also caused the containers to shutdown; this was - of course - enough to bring the box back to normal : the load is back to 0 and I can't see a 'Too many open files' message in the journal.
Today and not for the first time, I could spot occurrences of this message in the logs
Jan 14 16:07:45 thurst dockerd-current[1574]: OSError: [Errno 24] Too many open files
This clearly happens after a few weeks of operations. For the time being this is another reason to reboot the nbhosting box every once and again - like a couple weeks, or maybe even just one, feels about right
It is not very clear at this point how to handle it properly; the django app does not seem to be the culprit here, although even that is unclear.