nebari-dev / nebari-slurm

An opinionated open source deployment of jupyterhub based on an Slurm job scheduler.
BSD 3-Clause "New" or "Revised" License
28 stars 10 forks source link

[BUG] Issues while running the ansible playbook #171

Open viniciusdc opened 2 months ago

viniciusdc commented 2 months ago

Context

The new Redis addition to the cluster seems to be missing some validation checks in the current role. The final service also seems to be racing against another default service with the same name (mostly a default initialization when the package is first installed), leading to blocking ports, which in turn leads to the conda-store service being down.

This one below is a snipet for the current ansible task falling. This is a quick fix: Captura de tela de 2024-08-30 17 22 29

This is the troublesome one: image appears

Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:C 06 Sep 2024 14:37:08.677 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:C 06 Sep 2024 14:37:08.677 * Redis version=7.4.0, bits=64, commit=00000000, modified=0, pid=11781, just started
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:C 06 Sep 2024 14:37:08.677 * Configuration loaded
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:M 06 Sep 2024 14:37:08.677 * Increased maximum number of open files to 10032 (it was originally set to 1024).
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:M 06 Sep 2024 14:37:08.677 * monotonic clock: POSIX clock_gettime
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:M 06 Sep 2024 14:37:08.678 * Running mode=standalone, port=6379.
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:M 06 Sep 2024 14:37:08.678 # Warning: Could not create server TCP listening socket 127.0.0.1:6379: bind: Address already in use
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:M 06 Sep 2024 14:37:08.678 # Failed listening on port 6379 (tcp), aborting.
Sep 06 14:37:08 hpc01-test systemd[1]: redis.service: Main process exited, code=exited, status=1/FAILURE
Sep 06 14:37:08 hpc01-test systemd[1]: redis.service: Failed with result 'exit-code'.

Value and/or benefit

Success at deploying and launching nebari-slurm

Anything else?

The main problem might be due to the redis-server.service starting up on installation, it might be as simple as disabling the service before we create our custom redis.service: https://github.com/nebari-dev/nebari-slurm/blob/4ff70836391720b0a48d7e7d3a6ab954c576d541/roles/redis/tasks/redis.yaml#L40-L59

viniciusdc commented 2 months ago

This issue can, in theory, be worked around by manually disabling the conflicting service using systemctl, though while testing this, I still had issues with conda-store not properly connecting.