Yesterday I wrongly attributed some strange behaviour to restarts. It turned out the reason for this was because of an invalid debugging configuration, which I used to create a Windows Bastion that I wanted for debugging the issues with the Windows slaves from Friday afternoon. If you look at the targets to create the environments, it checks for the existence of a debugging variable, and if so, uses a debug.tfvars file for Terraform. The main intention for this target is to enable a Windows Bastion host, by setting a count variable to 1.
It turned out this file was out of date and was using old values for variables like AMI IDs and sizes for machines. So based on the values in this file, when Terraform ran it ended up resizing the Jenkins master (turns out a running machine can be resized without destroying it), and used a different AMI for the Windows slaves, meaning they were destroyed and recreated. So that explains why the Windows machines looked fresh. However, the Windows slaves are using the wrong AMI and the Jenkins Master is the wrong size, so these changes need to be reverted back.
Unfortunately this doesn't explain why the SSL certificate needed regenerated, since the proxy instance wasn't touched. I've reproduced the behaviour on staging, but I still don't know what causes it.
I've changed the debugging setup now so that it just sets a single variable rather than using a whole set of debugging variables, and the debug.tfvars files have been deleted.
I've reproduced all this behaviour in the staging environment first. So once this is merged I can re-provision and it should get Jenkins back to normal.
Hi Stephen,
Yesterday I wrongly attributed some strange behaviour to restarts. It turned out the reason for this was because of an invalid debugging configuration, which I used to create a Windows Bastion that I wanted for debugging the issues with the Windows slaves from Friday afternoon. If you look at the targets to create the environments, it checks for the existence of a debugging variable, and if so, uses a
debug.tfvars
file for Terraform. The main intention for this target is to enable a Windows Bastion host, by setting a count variable to 1.It turned out this file was out of date and was using old values for variables like AMI IDs and sizes for machines. So based on the values in this file, when Terraform ran it ended up resizing the Jenkins master (turns out a running machine can be resized without destroying it), and used a different AMI for the Windows slaves, meaning they were destroyed and recreated. So that explains why the Windows machines looked fresh. However, the Windows slaves are using the wrong AMI and the Jenkins Master is the wrong size, so these changes need to be reverted back.
Unfortunately this doesn't explain why the SSL certificate needed regenerated, since the proxy instance wasn't touched. I've reproduced the behaviour on staging, but I still don't know what causes it.
I've changed the debugging setup now so that it just sets a single variable rather than using a whole set of debugging variables, and the
debug.tfvars
files have been deleted.I've reproduced all this behaviour in the staging environment first. So once this is merged I can re-provision and it should get Jenkins back to normal.
Cheers,
Chris