stackhpc / ansible-slurm-appliance

A Slurm-based HPC workload management environment, driven by Ansible.
49 stars 23 forks source link

Users are not always the same uid/gid on state directory after rebuild #183

Closed sjpb closed 11 months ago

sjpb commented 2 years ago

After rebuilding a cluster both prometheus and grafana failed to start in monitoring.yml. Investigation showed state files were not all owned by relevant users, e.g.:

[root@dev-control rocky]# ls -l /var/lib/state/prometheus/queries.active 
-rw-r--r--. 1 982 977 20001 May  5 12:26 /var/lib/state/prometheus/queries.active
[root@dev-control rocky]# id prometheus
uid=989(prometheus) gid=985(prometheus) groups=985(prometheus)

Running:

sudo chown -R prometheus /var/lib/state/prometheus # then  monitoring.yml reran OK
sudo chown -R grafana /var/lib/state/grafana

allowed monitoring.yml to complete.

The cloudalchemy roles create the users, so they'd need patching to check for UID/GID or something.

m-bull commented 2 years ago

Does this means that images need a consistent UID/GID for each (service) user? This would prevent instances of a new immutable image not being able to access files on a persistent storage created by an old immutable image with different UID->username mappings.

I guess you can probably rely on the playbook creating users with the same UID/GID if the tasks run in the same order, but if you add some new functionality that creates a user in the middle of the playbook, its going to claim the next incremental UID and throw off all of the UIDs after that one.

sjpb commented 2 years ago

I think what happened here is exactly what you said - I added new functionality and it's changed the users :-(

I think it means the playbooks need to create images with the right user. I.e. if the prom storage already exists, create user prometheus with uid/gid to match the storage. It should probably error if the user exists but with the wrong uid/gid.

Or maybe I hardcode the uid/gid to use??

sjpb commented 2 years ago

Or the roles chmod things as required, but that feels bad.

m-bull commented 2 years ago

You won't be able to chmod things that already exist in persistent storage I don't think - you will probably need to pre-allocate UID/GID to any service users that have files in persistent storage. The rest of the users don't matter, but anything that leaves state behind is going to be a potential problem...

sjpb commented 2 years ago

For info: I think the actual problem was changing from using a pre-built image to a generic-cloud image. Looks like prometheus user got changed from 982 on former to 989 on latter.

sjpb commented 2 years ago

Current users with openhpc-220526-1354.qcow2 (in PR), on control node during deployment:

/var/spool/slurm/ slurm (for slurmctld): 202 /var/lib/mysql/ mysql (for slurmdbd): 27 /var/lib/prometheus prometheus: uid 981 / gid 976 (NB user / group is NOT created in the pre-built image) /var/lib/grafana/ grafana: uid 984 / gid 979 /var/lib/podman podman: 1001

sjpb commented 2 years ago

Grafana at least is safe if we precreate the user:

[root@ci2588470616-control rocky]# rpm -q --scripts grafana
<snip>
        if ! getent group "$GRAFANA_GROUP" > /dev/null 2>&1 ; then
    groupadd -r "$GRAFANA_GROUP"
        fi
        if ! getent passwd "$GRAFANA_USER" > /dev/null 2>&1 ; then
    useradd -r -g grafana -d /usr/share/grafana -s /sbin/nologin \
    -c "grafana user" grafana
        fi
<snip>

Note that the above repo install also does this:

        # Set user permissions on /var/log/grafana, /var/lib/grafana
        mkdir -p /var/log/grafana /var/lib/grafana
        chown -R $GRAFANA_USER:$GRAFANA_GROUP /var/log/grafana /var/lib/grafana
        chmod 755 /var/log/grafana /var/lib/grafana

so the /var/lib directories will always exist, even if grafana_data_dir is set differently in the role: https://github.com/cloudalchemy/ansible-grafana/blob/master/tasks/configure.yml (which assumes the grafana user/group already exists)

sjpb commented 2 years ago

Precreating the service users with defined {u,g}ids in the appliance won't be enough - the slurm_image_builder would also have to do it (with the same info, obviously) as the package installs setup most of the users. That affects CI/CaaS cases which use those images, so we could just live with the fact that we need to make sure slurm_image_builder remains consistent, till we get rid of it?

sjpb commented 2 years ago

@m-bull to allow us to move forward on this (and hence finish #173) I suggest:

The slurm_image_builder is hopefully going away anyway once we've got a proper CaaS image build pipeline, so I think this is pragmatic.

m-bull commented 2 years ago

Sounds sensible.

sjpb commented 2 years ago

Should be fixed by 9893c35