openHPI / poseidon

Scalable task execution orchestrator for CodeOcean
MIT License
8 stars 1 forks source link

Nomad Agent "Too many open files" #675

Open mpass99 opened 1 week ago

mpass99 commented 1 week ago

Related to #612

Investigate the Linux Error Failed to allocate directory watch: Too many open files that appears when having a high number of runners on a single agent.

mpass99 commented 1 week ago

The warning does not happen for Nomad itself, but for systemctl restart commands.
It is not triggered by systemctl status nomad.service or systemctl start nomad.service. The error can also be triggered when just restarting Docker, instead of Nomad.

In our staging environment, the error happens not with 160 Runner aka 108 Nomad subprocesses, but with 180 Runners. Maybe the fs.inotify.max_user_instances=128 limit is exceeded.

I would consider this as low priority as it seems to be an error just with using systemctl. Also investigating this issue follows a small path next to crashing the Nomad agents for multiple restarts.

MrSerth commented 1 week ago

Thanks for digging further. Based on your discoveries I agree about the relatively lower priority of this issue :+1:. Shall we set it to pending?