Closed spuiuk closed 1 year ago
This was the trigger for implementing a log collection mechanism in sit-environment. Despite noticing it earlier I must admit that I failed to report and follow it up by making the required changes on Jenkins side. I will now come up with a change to let the statedir from host available as artifacts for each job.
@xhernandez From the first pass of logs available with a recent run there were nothing obvious that could result in a smb service restart failure.
On the other side we may also think about using smbcontrol
as an alternate method to reload configuration.
Checking the logs from CTDB, it seems that it was doing a recovery that ended roughly at the same time that ansible was restarting smb, so both ansible and ctdb were trying to restart smb which could cause some issues. However I'm not sure why CTDB was in recovery mode for so long. It should be healthy after installation, right ?
Checking the logs from CTDB, it seems that it was doing a recovery that ended roughly at the same time that ansible was restarting smb, so both ansible and ctdb were trying to restart smb which could cause some issues. However I'm not sure why CTDB was in recovery mode for so long. It should be healthy after installation, right ?
Recovery process(or even the initial start) will take couple of seconds to complete. Looking at the order of tasks, CTDB restart/start is the last one in the list for ctdb.setup role and the next smb
restart comes in the middle from the set of tasks in samba.setup role.
Thursday 02 November 2023 02:30:12 +0000 (0:00:00.549) 0:06:58.547 *****
changed: [storage0]
changed: [storage2]
changed: [storage1]
Thursday 02 November 2023 02:30:32 +0000 (0:00:00.621) 0:07:18.247 *****
changed: [storage0]
changed: [storage1]
fatal: [storage2]: FAILED! => {"changed": false, "msg": "Unable to start service smb: Job for smb.service canceled.\n"}
In the above failure case we see an interval of 20 seconds.
Wednesday 01 November 2023 02:29:44 +0000 (0:00:00.565) 0:06:58.138 ****
changed: [storage1]
changed: [storage0]
changed: [storage2]
Wednesday 01 November 2023 02:30:13 +0000 (0:00:00.553) 0:07:27.315 ****
changed: [storage1]
changed: [storage2]
changed: [storage0]
In the above success case we see an interval of 29 seconds.
WRT to Cephfs I have now seen multiple CI runs for PRs which fail with the following.
Eg: https://jenkins-samba.apps.ocp.cloud.ci.centos.org/job/samba_cephfs-integration-test-cases/122/console https://jenkins-samba.apps.ocp.cloud.ci.centos.org/job/samba_cephfs-integration-test-cases/121/console