stackhpc / ansible-slurm-appliance

A Slurm-based HPC workload management environment, driven by Ansible.
43 stars 18 forks source link

Update opensearch to 2.9.0 #299

Closed sjpb closed 1 year ago

sjpb commented 1 year ago

Updates opensearch to v2.9.0, required as opensearch 2.4.0 fails* on podman v4.4.1.

Also:

* Container startup fails with

Duplicate cpuset controllers detected.
...
Error: Could not find or load main class 

Actual problem is /sys/fs/cgroup gets mounted twice inside the container with podman v4.4.1, opensearch 2.4.0 cannot tolerate this.

sjpb commented 1 year ago

Cancelled CI, need image build first.

Image build running in https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/5810024081/job/15750069870

edit: building image openhpc-230809-1401-2aa07061

sjpb commented 1 year ago

Image build running in https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/5811371127/job/15754445881

Built image openhpc-230809-1602-2250239e

sjpb commented 1 year ago

I've checked that upgrading a cluster from current main (e6645fd8dd8c2875e5c8f3981d05eb316e1c2c6c) to 2937725 works ok, in that:

I also then reimaged the cluster again (at 2937725) to check the case where the slurm_jobid_index flag file does exist, reran site.yml, and checked that the opensearch document IDs did not change and monitoring was not duplicated.

Note that document IDs are not slurm job ids (but are stable):

[root@main-control rocky]# curl -ks -u admin:${vault_elasticsearch_admin_password} https://localhost:9200/filebeat-7.12.1-2023.08.10/_search?pretty | grep id
        "_id" : "7add60a6e14c4a7c931b298885049ce202050131faeb42a1cdffdd8cbda18e15",
            "ephemeral_id" : "cce7e423-94b9-42b5-b173-e3e248d0cf6a",
            "id" : "112144b4-dab0-4f20-948a-a11526b86784",
        "_id" : "705983cec81172db226a753f22a1d2adf3667021c8acaf9e3441c47613652955",
            "ephemeral_id" : "cce7e423-94b9-42b5-b173-e3e248d0cf6a",
            "id" : "112144b4-dab0-4f20-948a-a11526b86784",
        "_id" : "07ca87294ea583986bf129b4ad84e2ed2539c8e7d1eabe6738bfe90d90dfe01d",
            "ephemeral_id" : "cce7e423-94b9-42b5-b173-e3e248d0cf6a",
            "id" : "112144b4-dab0-4f20-948a-a11526b86784"
        "_id" : "e252989764ecf0ebb95af485cd8741dccaf9fdd74d46020351a3ffe1cb05dafb",
            "ephemeral_id" : "cce7e423-94b9-42b5-b173-e3e248d0cf6a",
            "id" : "112144b4-dab0-4f20-948a-a11526b86784"
        "_id" : "c57f31eee7910a1c04dbbf0e4a2e96dffd46b48dc157d5a8d91bad4287e7a070",
            "ephemeral_id" : "cce7e423-94b9-42b5-b173-e3e248d0cf6a",
            "id" : "112144b4-dab0-4f20-948a-a11526b86784"

See comment in environments/common/files/filebeat/filebeat.yml for why they're not actual job IDs.