The new Redis addition to the cluster seems to be missing some validation checks in the current role. The final service also seems to be racing against another default service with the same name (mostly a default initialization when the package is first installed), leading to blocking ports, which in turn leads to the conda-store service being down.
This one below is a snipet for the current ansible task falling. This is a quick fix:
This is the troublesome one:
appears
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:C 06 Sep 2024 14:37:08.677 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:C 06 Sep 2024 14:37:08.677 * Redis version=7.4.0, bits=64, commit=00000000, modified=0, pid=11781, just started
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:C 06 Sep 2024 14:37:08.677 * Configuration loaded
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:M 06 Sep 2024 14:37:08.677 * Increased maximum number of open files to 10032 (it was originally set to 1024).
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:M 06 Sep 2024 14:37:08.677 * monotonic clock: POSIX clock_gettime
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:M 06 Sep 2024 14:37:08.678 * Running mode=standalone, port=6379.
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:M 06 Sep 2024 14:37:08.678 # Warning: Could not create server TCP listening socket 127.0.0.1:6379: bind: Address already in use
Sep 06 14:37:08 hpc01-test redis-server[11781]: 11781:M 06 Sep 2024 14:37:08.678 # Failed listening on port 6379 (tcp), aborting.
Sep 06 14:37:08 hpc01-test systemd[1]: redis.service: Main process exited, code=exited, status=1/FAILURE
Sep 06 14:37:08 hpc01-test systemd[1]: redis.service: Failed with result 'exit-code'.
This issue can, in theory, be worked around by manually disabling the conflicting service using systemctl, though while testing this, I still had issues with conda-store not properly connecting.
Context
The new Redis addition to the cluster seems to be missing some validation checks in the current role. The final service also seems to be racing against another default service with the same name (mostly a default initialization when the package is first installed), leading to blocking ports, which in turn leads to the conda-store service being down.
This one below is a snipet for the current ansible task falling. This is a quick fix:
This is the troublesome one: appears
Value and/or benefit
Success at deploying and launching nebari-slurm
Anything else?
The main problem might be due to the
redis-server.service
starting up on installation, it might be as simple as disabling the service before we create our customredis.service
: https://github.com/nebari-dev/nebari-slurm/blob/4ff70836391720b0a48d7e7d3a6ab954c576d541/roles/redis/tasks/redis.yaml#L40-L59