Open reinaldosaraiva opened 4 months ago
That would happen if the Ceph cluster isn't functional.
This most commonly happen if you have fully redone your deployment without also wiping the data from the ansible/data
directory.
In this scenario you end up with a freshly deployed cluster that's still expecting the servers from the previous deployment and so is unable to achieve a quorum, causing the Ceph API to fall to come online and results in the configuration failure you're getting.
Ceph monitor initialization issue: monmap min_mon_release older than installed version ERROR: Jul 11 16:51:06 distrobuilder-5cca1f2a-f8a9-4b77-a1df-8173d38747bc systemd[1]: Created slice Slice /system/ceph-mon. Jul 11 16:51:06 distrobuilder-5cca1f2a-f8a9-4b77-a1df-8173d38747bc systemd[1]: Reached target System Time Synchronized. Jul 11 16:51:06 distrobuilder-5cca1f2a-f8a9-4b77-a1df-8173d38747bc systemd[1]: Started Ceph cluster monitor daemon. Jul 11 16:51:06 distrobuilder-5cca1f2a-f8a9-4b77-a1df-8173d38747bc ceph-mon[6467]: 2024-07-11T16:51:06.738+0000 7f0cf2c8cc40 -1 mon.server01@-1(probing) e0 current monmap has recorded min_mon_release 15 (octopus) is more than two releases older than installed 18 (reef); you can only upgrade 2 releases at a time Jul 11 16:51:06 distrobuilder-5cca1f2a-f8a9-4b77-a1df-8173d38747bc ceph-mon[6467]: you should first upgrade to 16 (pacific) or 17 (quincy)
Can you show monmaptool --show ansible/data/ceph/cluster.FSID.mon.map
?
Normally the logic in the playbook is to set the min-mon-release in the mon map to the same release as ceph_release
(reef by default).
That would happen if the Ceph cluster isn't functional.
This most commonly happen if you have fully redone your deployment without also wiping the data from the
ansible/data
directory.In this scenario you end up with a freshly deployed cluster that's still expecting the servers from the previous deployment and so is unable to achieve a quorum, causing the Ceph API to fall to come online and results in the configuration failure you're getting.
I have already cleaned the data/ceph/ folder and others. I also used both Quincy and Reef versions. I am lost in this deployment.
Also the output of git rev-parse HEAD
would be useful
git rev-parse HEAD
root@haruunkal:~/incus-deploy# git rev-parse HEAD f207054ed42fbcfb9916c4452e8abc60bd14bcbb
Okay, so it shouldn't be because of lack of support for calling monmaptool with the needed set-min-mon-release, but then it's pretty confusing as to why it would have set a release of 15 when it should have been passed 18.
The output of monmaptool --show ansible/data/ceph/cluster.FSID.mon.map
may help figure it out
Thank you very much for your support. It seems that there was an issue with my lab workstation that was resolved only when I disabled the IPv6 network. After that, the entire process ran perfectly.
Okay, so it shouldn't be because of lack of support for calling monmaptool with the needed set-min-mon-release, but then it's pretty confusing as to why it would have set a release of 15 when it should have been passed 18.
The output of
monmaptool --show ansible/data/ceph/cluster.FSID.mon.map
may help figure it out
root@haruunkal:~/incus-deploy# monmaptool --print ansible/data/ceph/cluster.e2850e1f-7aab-472e-b6b1-824e19a75071.mon.map monmaptool: monmap file ansible/data/ceph/cluster.e2850e1f-7aab-472e-b6b1-824e19a75071.mon.map epoch 0 fsid e2850e1f-7aab-472e-b6b1-824e19a75071 last_changed 2024-07-11T15:15:56.636758-0300 created 2024-07-11T15:15:56.636758-0300 min_mon_release 15 (octopus) election_strategy: 1 0: v1:10.177.121.10:6789/0 mon.server03 1: v1:10.177.121.13:6789/0 mon.server01 2: v1:10.177.121.242:6789/0 mon.server02
rsrsrs. Other error: TASK [Install the Incus package] *** task path: /root/incus-deploy/ansible/books/incus.yaml:60
Yeah, so the min_mon_release 15 (octopus)
is obviously going to be a problem but I don't get why it would be set to that when we specifically call monmaptool
with the argument to set it to 18...
Maybe that older version of monmaptool
doesn't know how to handle that properly?
You could add the Ceph repository to your own machine and then update to a new version of monmaptool
, that would certainly fix that issue, it just shouldn't be necessary...
Having the exact same issue here
Description:When running the Ansible playbook deploy.yaml from the incus-deploy project, an error occurs while attempting to enable the msgr2 messenger in Ceph. The ceph mon enable-msgr2 command fails with a timeout, indicating that it could not connect to the RADOS cluster.
Error Message: fatal: [server01]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.070563", "end": "2024-07-11 13:43:18.284315", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.213752", "stderr": "2024-07-11T13:43:18.279+0000 7ff21f567640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.279+0000 7ff21f567640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []} fatal: [server03]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.109144", "end": "2024-07-11 13:43:18.320621", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.211477", "stderr": "2024-07-11T13:43:18.316+0000 7fc48f66d640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.316+0000 7fc48f66d640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []} fatal: [server02]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.093801", "end": "2024-07-11 13:43:18.316757", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.222956", "stderr": "2024-07-11T13:43:18.314+0000 7f4cb7b4a640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.314+0000 7f4cb7b4a640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}
Steps to Reproduce:
Execute the Ansible playbook deploy.yaml in the directory ~/incus-deploy/ansible. Observe the error during the task to enable the msgr2 messenger in Ceph. Expected Behavior:
The ceph mon enable-msgr2 command should execute without errors, enabling the msgr2 messenger in the Ceph cluster.
Actual Behavior:
The ceph mon enable-msgr2 command fails with a timeout, indicating it could not connect to the RADOS cluster.
Additional Details:
The error occurs on multiple servers (server01, server02, server03). Specific error message: RADOS timed out (error connecting to the cluster). The playbook was executed as root. Environment:
Ansible version: [2.17.1]] Ubuntu: 22.04
Execute: root@haruunkal:~/incus-deploy/terraform# cd ../ansible/ root@haruunkal:~/incus-deploy/ansible# ansible-playbook deploy.yaml
PLAY [Ceph - Generate cluster keys and maps] ****
TASK [Gathering Facts] ** [WARNING]: Platform linux on host server03 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible- core/2.17/reference_appendices/interpreter_discovery.html for more information. ok: [server03] [WARNING]: Platform linux on host server04 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible- core/2.17/reference_appendices/interpreter_discovery.html for more information. ok: [server04] [WARNING]: Platform linux on host server02 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible- core/2.17/reference_appendices/interpreter_discovery.html for more information. ok: [server02] [WARNING]: Platform linux on host server05 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible- core/2.17/reference_appendices/interpreter_discovery.html for more information. ok: [server05] [WARNING]: Platform linux on host server01 is using the discovered Python interpreter at /usr/bin/python3.10, but future installation of another Python interpreter could change the meaning of that path. See https://docs.ansible.com/ansible- core/2.17/reference_appendices/interpreter_discovery.html for more information. ok: [server01]
TASK [Generate mon keyring] ***** changed: [server03 -> 127.0.0.1] ok: [server04 -> 127.0.0.1] ok: [server01 -> 127.0.0.1] ok: [server05 -> 127.0.0.1] ok: [server02 -> 127.0.0.1]
TASK [Generate client.admin keyring] **** changed: [server03 -> 127.0.0.1] ok: [server04 -> 127.0.0.1] ok: [server01 -> 127.0.0.1] ok: [server05 -> 127.0.0.1] ok: [server02 -> 127.0.0.1]
TASK [Generate bootstrap-osd keyring] *** changed: [server03 -> 127.0.0.1] ok: [server04 -> 127.0.0.1] ok: [server01 -> 127.0.0.1] ok: [server05 -> 127.0.0.1] ok: [server02 -> 127.0.0.1]
TASK [Generate mon map] ***** changed: [server03 -> 127.0.0.1] ok: [server04 -> 127.0.0.1] ok: [server01 -> 127.0.0.1] ok: [server05 -> 127.0.0.1] ok: [server02 -> 127.0.0.1]
RUNNING HANDLER [Add key to client.admin keyring] *** changed: [server03 -> 127.0.0.1]
RUNNING HANDLER [Add key to bootstrap-osd keyring] ** changed: [server03 -> 127.0.0.1]
RUNNING HANDLER [Add nodes to mon map] ** changed: [server03 -> 127.0.0.1] => (item={'name': 'server01', 'ip': 'fd42:60dc:dec6:a73b:216:3eff:fe2d:4c57'}) changed: [server03 -> 127.0.0.1] => (item={'name': 'server02', 'ip': 'fd42:60dc:dec6:a73b:216:3eff:fe05:31f6'}) changed: [server03 -> 127.0.0.1] => (item={'name': 'server03', 'ip': 'fd42:60dc:dec6:a73b:216:3eff:fe01:1c21'})
PLAY [Ceph - Add package repository] ****
TASK [Gathering Facts] ** ok: [server04] ok: [server05] ok: [server03] ok: [server01] ok: [server02]
TASK [Create apt keyring path] ** ok: [server03] ok: [server01] ok: [server05] ok: [server04] ok: [server02]
TASK [Add ceph GPG key] ***** changed: [server04] changed: [server03] changed: [server05] changed: [server01] changed: [server02]
TASK [Get DPKG architecture] **** ok: [server04] ok: [server03] ok: [server05] ok: [server01] ok: [server02]
TASK [Add ceph package sources] ***** changed: [server03] changed: [server05] changed: [server04] changed: [server02] changed: [server01]
RUNNING HANDLER [Update apt] **** changed: [server01] changed: [server04] changed: [server05] changed: [server03] changed: [server02]
PLAY [Ceph - Install packages] **
TASK [Gathering Facts] ** ok: [server01] ok: [server04] ok: [server05] ok: [server03] ok: [server02]
TASK [Install ceph-common] ** changed: [server02] changed: [server03] changed: [server05] changed: [server04] changed: [server01]
TASK [Install ceph-mon] ***** skipping: [server04] skipping: [server05] changed: [server03] changed: [server01] changed: [server02]
TASK [Install ceph-mgr] ***** skipping: [server04] skipping: [server05] changed: [server03] changed: [server02] changed: [server01]
TASK [Install ceph-mds] ***** skipping: [server04] skipping: [server05] changed: [server01] changed: [server02] changed: [server03]
TASK [Install ceph-osd] ***** changed: [server01] changed: [server04] changed: [server03] changed: [server02] changed: [server05]
TASK [Install ceph-rbd-mirror] ** skipping: [server01] skipping: [server02] skipping: [server04] skipping: [server05] skipping: [server03]
TASK [Install radosgw] ** skipping: [server01] skipping: [server02] skipping: [server03] changed: [server04] changed: [server05]
PLAY [Ceph - Set up config and keyrings] ****
TASK [Transfer the cluster configuration] *** changed: [server01] changed: [server04] changed: [server03] changed: [server05] changed: [server02]
TASK [Create main storage directory] **** ok: [server04] ok: [server01] ok: [server03] ok: [server05] ok: [server02]
TASK [Create monitor bootstrap path] **** skipping: [server05] skipping: [server04] changed: [server01] changed: [server03] changed: [server02]
TASK [Create OSD bootstrap path] **** changed: [server05] changed: [server04] changed: [server01] changed: [server03] changed: [server02]
TASK [Transfer main admin keyring] ** changed: [server05] changed: [server03] changed: [server01] changed: [server02] changed: [server04]
TASK [Transfer additional client keyrings] ** skipping: [server05] skipping: [server03] skipping: [server04] skipping: [server01] skipping: [server02]
TASK [Transfer bootstrap mon keyring] *** skipping: [server05] skipping: [server04] changed: [server03] changed: [server02] changed: [server01]
TASK [Transfer bootstrap mon map] *** skipping: [server05] skipping: [server04] changed: [server03] changed: [server02] changed: [server01]
TASK [Transfer bootstrap OSD keyring] *** changed: [server05] changed: [server04] changed: [server01] changed: [server03] changed: [server02]
RUNNING HANDLER [Restart Ceph] ** changed: [server05] changed: [server03] changed: [server02] changed: [server04] changed: [server01]
PLAY [Ceph - Deploy mon] ****
TASK [Gathering Facts] ** ok: [server01] ok: [server02] ok: [server05] ok: [server04] ok: [server03]
TASK [Bootstrap Ceph mon] *** skipping: [server04] skipping: [server05] changed: [server02] changed: [server03] changed: [server01]
TASK [Enable and start Ceph mon] **** skipping: [server04] skipping: [server05] changed: [server02] changed: [server03] changed: [server01]
RUNNING HANDLER [Enable msgr2] ** fatal: [server01]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.070563", "end": "2024-07-11 13:43:18.284315", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.213752", "stderr": "2024-07-11T13:43:18.279+0000 7ff21f567640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.279+0000 7ff21f567640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []} fatal: [server03]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.109144", "end": "2024-07-11 13:43:18.320621", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.211477", "stderr": "2024-07-11T13:43:18.316+0000 7fc48f66d640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.316+0000 7fc48f66d640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []} fatal: [server02]: FAILED! => {"changed": true, "cmd": "ceph mon enable-msgr2", "delta": "0:05:00.093801", "end": "2024-07-11 13:43:18.316757", "msg": "non-zero return code", "rc": 1, "start": "2024-07-11 13:38:18.222956", "stderr": "2024-07-11T13:43:18.314+0000 7f4cb7b4a640 0 monclient(hunting): authenticate timed out after 300\n[errno 110] RADOS timed out (error connecting to the cluster)", "stderr_lines": ["2024-07-11T13:43:18.314+0000 7f4cb7b4a640 0 monclient(hunting): authenticate timed out after 300", "[errno 110] RADOS timed out (error connecting to the cluster)"], "stdout": "", "stdout_lines": []}
PLAY RECAP ** server01 : ok=29 changed=18 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
server02 : ok=29 changed=18 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
server03 : ok=32 changed=25 unreachable=0 failed=1 skipped=3 rescued=0 ignored=0
server04 : ok=22 changed=11 unreachable=0 failed=0 skipped=10 rescued=0 ignored=0
server05 : ok=22 changed=11 unreachable=0 failed=0 skipped=10 rescued=0 ignored=0