rackslab / Slurm-web

Open source web dashboard for Slurm HPC clusters
https://slurm-web.com
GNU General Public License v3.0
295 stars 89 forks source link

munged loop causes system crash due to open files #211

Closed NathanielMiddleton closed 1 month ago

NathanielMiddleton commented 4 years ago

I have been trying to get slurm-web to work in a docker container on Centos 7.6. After about 16-18 hours of the container being up... the whole server comes crashing down as it can not spawn processes after that point. The source of this appears to be munged spawning every second... looping the following constantly in munged.log: 2020-03-30 21:27:46 +0000 Notice: Running on "c5f156dc059c" (172.17.0.2) 2020-03-30 21:27:46 +0000 Info: PRNG seeded with 1024 bytes from "/dev/urandom" 2020-03-30 21:27:46 +0000 Info: Updating supplementary group mapping every 3600 seconds 2020-03-30 21:27:46 +0000 Info: Enabled supplementary group mtime check of "/etc/group" 2020-03-30 21:27:46 +0000 Info: Removed existing socket "/var/run/munge/munge.socket.2" 2020-03-30 21:27:46 +0000 Notice: Starting munge-0.5.11 daemon (pid 3107) 2020-03-30 21:27:46 +0000 Info: Created 2 work threads 2020-03-30 21:27:46 +0000 Info: Found 1 user with supplementary groups in 0.001 seconds 2020-03-30 21:27:47 +0000 Notice: Running on "c5f156dc059c" (172.17.0.2) 2020-03-30 21:27:47 +0000 Info: PRNG seeded with 1024 bytes from "/dev/urandom" 2020-03-30 21:27:47 +0000 Info: Updating supplementary group mapping every 3600 seconds 2020-03-30 21:27:47 +0000 Info: Enabled supplementary group mtime check of "/etc/group" 2020-03-30 21:27:47 +0000 Info: Removed existing socket "/var/run/munge/munge.socket.2" 2020-03-30 21:27:47 +0000 Notice: Starting munge-0.5.11 daemon (pid 3120) 2020-03-30 21:27:47 +0000 Info: Created 2 work threads 2020-03-30 21:27:47 +0000 Info: Found 1 user with supplementary groups in 0.001 seconds

Any ideas on what is happening here?

nothing-fr commented 2 years ago

Same error here... don't understand why...

BlackS52 commented 1 year ago

Made some silly fix. Like a fast workaround

cat /etc/service/munge/run
#!/bin/bash
set -e

mkdir -p /var/run/munge
chown munge: /var/{log,lib,run}/munge
if [[ $(ps aux| grep "/usr/sbin/munged -f"| wc -l) -le 2 ]]; then
        exec /sbin/setuser munge /usr/sbin/munged -f
fi
rezib commented 2 months ago

This issue concerns Slurm-web v2 which is not maintained anymore. You are highly encouraged to test the new version v3.0.0 for which the quick start guide is available online: https://docs.rackslab.io/slurm-web/install/quickstart.html

Note that Slurm-web v3.0.0 is officially supported on CentOS 8 with RPM packages. For older versions, we plan to distribute containers and this effort is tracked in https://github.com/rackslab/Slurm-web/issues/266.

Unless someone is motivated to maintain the old version of Slurm-web or you have a justified reason to keep this issue open, it will be closed in a few weeks.

rezib commented 1 month ago

For the reasons explained in the previous comment, I finally close this issue.