This mostly works, but it is not production-ready. Key things:
podman now won't allow you (by default at least) to use unqualified image names. Fixed by changing to fully-qualified names (and fixing the arcus registry to mirror docker.io)
couldn't reset the podman database after adding the tmp directory config (see roles/podman/config.yml) - fixed by simply not doing these tasks, I haven't convinced myself this is OK. The original code was specifying tmp_dir, which is documented in containers.conf as having to be on a tmpfs, but that isn't mentioned in podman docs. The rootless tutorial doesn't mention it but states that "$XDG_RUNTIME_DIRdefaults on most systems to /run/user/$UID", which doesn't exist (with the containers up) for the podman user (and woudn't be a tmpfs).
time="2023-10-13T13:57:06Z" level=warning msg="The cgroupv2 manager is set to systemd but there is no systemd user session available"
time="2023-10-13T13:57:06Z" level=warning msg="For using systemd, you may need to login using an user session"
time="2023-10-13T13:57:06Z" level=warning msg="Alternatively, you can enable lingering with: `loginctl enable-linger 1001` (possibly as root)"
time="2023-10-13T13:57:06Z" level=warning msg="Falling back to --cgroup-manager=cgroupfs"
time="2023-10-13T13:57:06Z" level=error msg="unlinkat /run/podman/libpod/tmp: permission denied"
the openhpc role has its own PR: https://github.com/stackhpc/ansible-role-openhpc/pull/164. There is some incomplete stuff here (e.g. this PR won't work on RL8) but it also needs the "generic slurm" PR merging so we can define cgroups.conf properly which appears to be necessary. Really I'd like to move the plugin defaults to use cgroups too, but this doens't work in a container (although see OpenHPC slack for a possible workaround).
monitoring.yml fails b/c there's no prometheus-slurm-exporter build for RL9. This is our repo, so it'd presumably be an easy fix.
This mostly works, but it is not production-ready. Key things:
podman now won't allow you (by default at least) to use unqualified image names. Fixed by changing to fully-qualified names (and fixing the arcus registry to mirror docker.io)
couldn't reset the podman database after adding the tmp directory config (see roles/podman/config.yml) - fixed by simply not doing these tasks, I haven't convinced myself this is OK. The original code was specifying
tmp_dir
, which is documented in containers.conf as having to be on a tmpfs, but that isn't mentioned in podman docs. The rootless tutorial doesn't mention it but states that "$XDG_RUNTIME_DIRdefaults on most systems to /run/user/$UID", which doesn't exist (with the containers up) for thepodman
user (and woudn't be a tmpfs).podman info
showsboth of which are mounted on
/
podman systemd units complain:
the openhpc role has its own PR: https://github.com/stackhpc/ansible-role-openhpc/pull/164. There is some incomplete stuff here (e.g. this PR won't work on RL8) but it also needs the "generic slurm" PR merging so we can define cgroups.conf properly which appears to be necessary. Really I'd like to move the plugin defaults to use cgroups too, but this doens't work in a container (although see OpenHPC slack for a possible workaround).
monitoring.yml fails b/c there's no
prometheus-slurm-exporter
build for RL9. This is our repo, so it'd presumably be an easy fix.