Closed sjpb closed 11 months ago
Does this means that images need a consistent UID/GID for each (service) user? This would prevent instances of a new immutable image not being able to access files on a persistent storage created by an old immutable image with different UID->username mappings.
I guess you can probably rely on the playbook creating users with the same UID/GID if the tasks run in the same order, but if you add some new functionality that creates a user in the middle of the playbook, its going to claim the next incremental UID and throw off all of the UIDs after that one.
I think what happened here is exactly what you said - I added new functionality and it's changed the users :-(
I think it means the playbooks need to create images with the right user. I.e. if the prom storage already exists, create user prometheus
with uid/gid to match the storage. It should probably error if the user exists but with the wrong uid/gid.
Or maybe I hardcode the uid/gid to use??
Or the roles chmod things as required, but that feels bad.
You won't be able to chmod
things that already exist in persistent storage I don't think - you will probably need to pre-allocate UID/GID to any service users that have files in persistent storage. The rest of the users don't matter, but anything that leaves state behind is going to be a potential problem...
For info: I think the actual problem was changing from using a pre-built image to a generic-cloud image. Looks like prometheus user got changed from 982 on former to 989 on latter.
Current users with openhpc-220526-1354.qcow2 (in PR), on control
node during deployment:
/var/spool/slurm/ slurm (for slurmctld): 202 /var/lib/mysql/ mysql (for slurmdbd): 27 /var/lib/prometheus prometheus: uid 981 / gid 976 (NB user / group is NOT created in the pre-built image) /var/lib/grafana/ grafana: uid 984 / gid 979 /var/lib/podman podman: 1001
Grafana at least is safe if we precreate the user:
[root@ci2588470616-control rocky]# rpm -q --scripts grafana
<snip>
if ! getent group "$GRAFANA_GROUP" > /dev/null 2>&1 ; then
groupadd -r "$GRAFANA_GROUP"
fi
if ! getent passwd "$GRAFANA_USER" > /dev/null 2>&1 ; then
useradd -r -g grafana -d /usr/share/grafana -s /sbin/nologin \
-c "grafana user" grafana
fi
<snip>
Note that the above repo install also does this:
# Set user permissions on /var/log/grafana, /var/lib/grafana
mkdir -p /var/log/grafana /var/lib/grafana
chown -R $GRAFANA_USER:$GRAFANA_GROUP /var/log/grafana /var/lib/grafana
chmod 755 /var/log/grafana /var/lib/grafana
so the /var/lib directories will always exist, even if grafana_data_dir
is set differently in the role: https://github.com/cloudalchemy/ansible-grafana/blob/master/tasks/configure.yml (which assumes the grafana
user/group already exists)
Precreating the service users with defined {u,g}ids in the appliance won't be enough - the slurm_image_builder
would also have to do it (with the same info, obviously) as the package installs setup most of the users. That affects CI/CaaS cases which use those images, so we could just live with the fact that we need to make sure slurm_image_builder
remains consistent, till we get rid of it?
@m-bull to allow us to move forward on this (and hence finish #173) I suggest:
slurm_image_builder
image + a slurm appliance run)slurm_image_builder
, and we just tolerate the fact that will need rejigging if we add packages there which add additional users.The slurm_image_builder
is hopefully going away anyway once we've got a proper CaaS image build pipeline, so I think this is pragmatic.
Sounds sensible.
After rebuilding a cluster both prometheus and grafana failed to start in monitoring.yml. Investigation showed state files were not all owned by relevant users, e.g.:
Running:
allowed
monitoring.yml
to complete.The cloudalchemy roles create the users, so they'd need patching to check for UID/GID or something.