scylladb / scylla-machine-image

Apache License 2.0
18 stars 25 forks source link

packer: fix provision race-conditions following reboot #459

Closed benipeled closed 1 year ago

benipeled commented 1 year ago

I found out that (periodically) the GCE build fails on uploading the files to packer instance

14:10:39      gce: NEEDRESTART-KEXP: 5.15.0-1035-gcp
14:10:39      gce: NEEDRESTART-KSTA: 3
14:10:43  ==> gce: Uploading files/ => /home/ubuntu/
14:11:03  ==> gce: Upload failed: wait: remote command exited without exit status or exit signal
.....
14:11:50  Build 'gce' errored after 2 minutes 30 seconds: wait: remote command exited without exit status or exit signal
14:11:50  
14:11:50  ==> Wait completed after 2 minutes 30 seconds
14:11:50  
14:11:50  ==> Some builds didn't complete successfully and had errors:
14:11:50  --> gce: wait: remote command exited without exit status or exit signal
14:11:50  
14:11:50  ==> Builds finished but no artifacts were created.

According to packer [0] it might be caused by the reboot action (part of the kernel-install recently added) causes race conditions between the provisions

Sometimes, when executing a command like reboot, the shell script will return and Packer will start executing the next one before SSH actually quits and the machine restarts

We should handle it with pause_before - I set it to 10s, if it's not gonna help I'll increase it to 20s and add the ssh_read_write_timeout

[0] https://developer.hashicorp.com/packer/docs/provisioners/shell#handling-reboots

benipeled commented 1 year ago

Verified with 3 builds (137, 136, 135) - works as expected https://jenkins.scylladb.com/job/scylla-master/job/releng-testing/job/next-machine-image/135

[2023-06-15T06:50:58.487Z] ==> gce: Pausing 10s before the next provisioner...
...
[2023-06-15T06:51:06.932Z] ==> gce: Uploading files/ => /home/ubuntu/
[2023-06-15T06:51:07.545Z]     gce: status: done
benipeled commented 1 year ago

Failed on master with SSH connection failure - so the fix helped but we need to increase the time

https://jenkins.scylladb.com/job/scylla-master/job/next-machine-image/478/consoleFull

15:40:13  Build 'gce' errored after 2 minutes 12 seconds: dial tcp 34.78.148.68:22: connect: connection refused
15:40:13  
15:40:13  ==> Wait completed after 2 minutes 12 seconds
15:40:13  
15:40:13  ==> Some builds didn't complete successfully and had errors:
15:40:13  --> gce: dial tcp 34.78.148.68:22: connect: connection refused
benipeled commented 1 year ago

Failed on master with SSH connection failure - so the fix helped but we need to increase the time

https://jenkins.scylladb.com/job/scylla-master/job/next-machine-image/478/consoleFull

15:40:13  �[1;31mBuild 'gce' errored after 2 minutes 12 seconds: dial tcp 34.78.148.68:22: connect: connection refused�[0m
15:40:13  
15:40:13  ==> Wait completed after 2 minutes 12 seconds
15:40:13  
15:40:13  ==> Some builds didn't complete successfully and had errors:
15:40:13  --> gce: dial tcp 34.78.148.68:22: connect: connection refused

Handled on https://github.com/scylladb/scylla-machine-image/pull/461