samba-in-kubernetes / sit-environment

SIT (Samba Integration Testing) Framework
GNU General Public License v3.0
1 stars 7 forks source link

Cephfs: CI runs fail. #53

Closed spuiuk closed 1 year ago

spuiuk commented 1 year ago

WRT to Cephfs I have now seen multiple CI runs for PRs which fail with the following.

TASK [samba.setup : Restart samba] *********************************************
Thursday 19 October 2023  17:12:39 +0000 (0:00:00.684)       0:06:33.374 ****** 
fatal: [storage1]: FAILED! => {"changed": false, "msg": "Unable to start service smb: Job for smb.service canceled.\n"}
fatal: [storage0]: FAILED! => {"changed": false, "msg": "Unable to start service smb: Job for smb.service canceled.\n"}
fatal: [storage2]: FAILED! => {"changed": false, "msg": "Unable to start service smb: Job for smb.service canceled.\n"}

Eg: https://jenkins-samba.apps.ocp.cloud.ci.centos.org/job/samba_cephfs-integration-test-cases/122/console https://jenkins-samba.apps.ocp.cloud.ci.centos.org/job/samba_cephfs-integration-test-cases/121/console

anoopcs9 commented 1 year ago

This was the trigger for implementing a log collection mechanism in sit-environment. Despite noticing it earlier I must admit that I failed to report and follow it up by making the required changes on Jenkins side. I will now come up with a change to let the statedir from host available as artifacts for each job.

anoopcs9 commented 1 year ago

@xhernandez From the first pass of logs available with a recent run there were nothing obvious that could result in a smb service restart failure.

On the other side we may also think about using smbcontrol as an alternate method to reload configuration.

xhernandez commented 1 year ago

Checking the logs from CTDB, it seems that it was doing a recovery that ended roughly at the same time that ansible was restarting smb, so both ansible and ctdb were trying to restart smb which could cause some issues. However I'm not sure why CTDB was in recovery mode for so long. It should be healthy after installation, right ?

anoopcs9 commented 1 year ago

Checking the logs from CTDB, it seems that it was doing a recovery that ended roughly at the same time that ansible was restarting smb, so both ansible and ctdb were trying to restart smb which could cause some issues. However I'm not sure why CTDB was in recovery mode for so long. It should be healthy after installation, right ?

Recovery process(or even the initial start) will take couple of seconds to complete. Looking at the order of tasks, CTDB restart/start is the last one in the list for ctdb.setup role and the next smb restart comes in the middle from the set of tasks in samba.setup role.

Thursday 02 November 2023  02:30:12 +0000 (0:00:00.549)       0:06:58.547 ***** 
changed: [storage0]
changed: [storage2]
changed: [storage1]
Thursday 02 November 2023  02:30:32 +0000 (0:00:00.621)       0:07:18.247 ***** 
changed: [storage0]
changed: [storage1]
fatal: [storage2]: FAILED! => {"changed": false, "msg": "Unable to start service smb: Job for smb.service canceled.\n"}

In the above failure case we see an interval of 20 seconds.

Wednesday 01 November 2023  02:29:44 +0000 (0:00:00.565)       0:06:58.138 **** 
changed: [storage1]
changed: [storage0]
changed: [storage2]
Wednesday 01 November 2023  02:30:13 +0000 (0:00:00.553)       0:07:27.315 **** 
changed: [storage1]
changed: [storage2]
changed: [storage0]

In the above success case we see an interval of 29 seconds.