parmentelat / nbhosting

nginx + django + docker architecture to host notebooks embedded from open-edx hosted MOOCs
Other
77 stars 8 forks source link

too many open files #58

Open parmentelat opened 6 years ago

parmentelat commented 6 years ago

Today and not for the first time, I could spot occurrences of this message in the logs

Jan 14 16:07:45 thurst dockerd-current[1574]: OSError: [Errno 24] Too many open files

This clearly happens after a few weeks of operations. For the time being this is another reason to reboot the nbhosting box every once and again - like a couple weeks, or maybe even just one, feels about right

It is not very clear at this point how to handle it properly; the django app does not seem to be the culprit here, although even that is unclear.

parmentelat commented 6 years ago

interesting read possibly/probably related here
https://github.com/elastic/elasticsearch/issues/18022

parmentelat commented 6 years ago

ulimit -n shows the allowed number of open files; this seems to be on a per-process basis

the odd thing is from within a container we can see some very decent limit in the 10**6, but in the host it shows only 1024

besides docker seems to have a --ulimit option; however as per the above, this does not seem to be our problem

parmentelat commented 6 years ago

global kernel setting is through /proc/sys/fs/file-max
right now thurst has 79190755; thermals has 26372286 (exactly one third)
it looks sa if it is proportional to the amount of memory that does not seem to be the problem here

I'm getting the same numbers from within a container, so again the settings seem fine

parmentelat commented 6 years ago

all this boils down to, the culprit seems to be the per-process limit in the root context. Could it affect dockerd ?

parmentelat commented 6 years ago

indeed dockerd does complain; as an experiment I did a live change with

root@thurst ~ (master *) # ulimit -n -S
1024
root@thurst ~ (master *) # ulimit -n -H
4096
root@thurst ~ (master *) # ulimit -n 20480
root@thurst ~ (master *) # ulimit -n -H
20480
root@thurst ~ (master *) # ulimit -n -S
20480

which did not seem to fix the problem, at least not without restarting anything


EDIT:

root@thurst ~ (master *) # grep LimitNOFILE /usr/lib/systemd/system/docker.service
LimitNOFILE=1048576
parmentelat commented 6 years ago

also maybe worth investigating, as it could be related I guess, the absence of /etc/docker/daemon.json triggers this

Jan 19 10:47:30 thurst dockerd-current[1571]: time="2018-01-19T10:47:30.203203015+01:00" level=info msg="Got signal to reload configuration, reloading from: /etc/docker/daemon.json"
Jan 19 10:47:30 thurst dockerd-current[1571]: time="2018-01-19T10:47:30.203267653+01:00" level=error msg="open /etc/docker/daemon.json: no such file or directory"
parmentelat commented 6 years ago

made an attempt at creating an empty /etc/docker/daemon.json on thermals
this seems to not create any additional problem when done live; however after a reboot this all ends up in nbhosting emitting an internal error 500, so that's not good at all

parmentelat commented 6 years ago

useful comments to monitor the issue

# running containers
root@thurst ~ (master *) # docker ps | grep -v IMAGE | wc -l
41
# number of files open by the docker daemon
root@thurst ~ (master *) # ls  /proc/$(pgrep dockerd)/fd | wc -l
242
# or 
root@thurst ~ (master *) # lsof -p $(pgrep dockerd) | wc -l
287
# total number of open files in the system
root@thurst ~ (master *) # lsof | wc -l
1148699
# cat /proc/sys/fs/file-max
79190755

don't get it...

parmentelat commented 6 years ago

as of jan 22 2018 I am adding lsof inside the flotpython3 image to help troubleshoot this leakage

also:

parmentelat commented 6 years ago
dockers=$(docker ps --format '{{.Names}}')
for docker in $dockers; do
    echo -n ==================== $docker " "
    docker exec -ti $docker lsof | wc -l
done
parmentelat commented 6 years ago

possible breakthrough; when inside a container:

root@f60d53d63043:~# ulimit -n
1048576

BUT

root@f60d53d63043:~# su - uid39096
No directory, logging in with HOME=/
uid39096@f60d53d63043:/$ id
uid=39096(uid39096) gid=39096(uid39096) groups=39096(uid39096)
uid39096@f60d53d63043:/$ ulimit -n
1024
parmentelat commented 6 years ago

in the commit below, an attempt to get this right is made in the dockerfile for the mooc in production

parmentelat/flotpython@f141802cf2cc5fd481831ba48e553e74098e988b

the image we use is based on ubuntu

when run as root, processes in the container have a decent limit of 1024*1024 open files
but regular user processes see a limit of 1024 open files which apparently is not enough for, I expect, the jupyter server process

proposed patch is to create a limit file in /etc/security/limits.d/ to increase this

users who need an upper limit might be the ones who play a lot with pdflatex when they download the notebooks as pdf. What is truly amazing though is that some containers have been observed with as many as several hundreds of thousands of entries in lsof.

Putting this deployment in observation; rolling out the new image is done lazily, hence it takes a good day before it's really effective

parmentelat commented 6 years ago

unfortunately, I could see the bad symptom again today.

it actually pops out immediately when you look at the CPU load, my conjecture is that when the issue triggers CPU load gets a steady increase. See load chart on nbhosting.inria.fr on Feb. 3 2018, between 22:00 and 23:00, load raises from 0 to 2 and won't get back down.

there were only 2 active containers when I ran this issue just now feb 6 circa 10:20; surprisingly one had been up for 2 days, which should not happen in normal circumstances.

I restarted the docker daemon - which also caused the containers to shutdown; this was - of course - enough to bring the box back to normal : the load is back to 0 and I can't see a 'Too many open files' message in the journal.