Closed timothystewart6 closed 9 months ago
@timothystewart6 I have just seen this - and I am not convinced that this would actually improve CI in its current form:
@timothystewart6 I have just seen this - and I am not convinced that this would actually improve CI in its current form:
- Why would running the Molecule jobs serially improve reliability? We are using GitHub-hosted runners, so we have no control over the actual machine that the jobs run. After checking the docs, I am very confident that "scheduling jobs from the same workflow" is not a feature that GitHub actions currently has. You can pin self-hosted runners to a on organization or repo or require certain labels for specific jobs, so if we had a self-hosted runner, we could set GitHub Actions up to use only that one and set the runner up to only accept one job at the time. But for GitHub-hosted runners, everything goes in to the same pool of shared runners, so limiting our concurrency does not help us. Instead, as far as I see, it only increases the time until all CI results are actually available.
- Concerning the Vagrant cache: I added this because (at the time at least), the GitHub runners were frequently (~30-50% of the time) getting HTTP 429: Too many requests responses from Vagrant cloud. So the cache improved reliability (and we were better citizens towards Vagrant Cloud) at the time. If that problem is not acute anymore, it is safe to remove the cache IMO.
Thanks @sleiner I am coming to the same conclusion too. This was more of debugging experiment to track down build errors. Our error seems to be a Virtual Box error.
This PR will not be merged, I am just working some of my suspicions, I have a few more to chase down.
Virtual Box Error
fatal: [localhost]: FAILED! => {"changed": false, "cmd": ["/usr/local/bin/vagrant", "up", "--no-provision"], "msg": "Failed to start the VM(s): See log file '/Users/runner/.cache/molecule/k3s-ansible/default/vagrant.err'", "rc": 1, "stderr": "### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\nThe guest machine entered an invalid state while waiting for it\nto boot. Valid states are 'starting, running'. The machine is in the\n'gurumeditation' state. Please verify everything is configured\nproperly and try again.\n\nIf the provider you're using has a GUI that comes with it,\nit is often helpful to open that and watch the machine, since the\nGUI often has more helpful error messages than Vagrant can retrieve.\nFor example, if you're using VirtualBox, run `vagrant up` while the\nVirtualBox GUI is open.\n\nThe primary issue for this error is that the provider you're using\nis not properly configured. This is very rarely a Vagrant issue.\n", "stderr_lines": ["### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "The guest machine entered an invalid state while waiting for it", "to boot. Valid states are 'starting, running'. The machine is in the", "'gurumeditation' state. Please verify everything is configured", "properly and try again.", "", "If the provider you're using has a GUI that comes with it,", "it is often helpful to open that and watch the machine, since the", "GUI often has more helpful error messages than Vagrant can retrieve.", "For example, if you're using VirtualBox, run `vagrant up` while the", "VirtualBox GUI is open.", "", "The primary issue for this error is that the provider you're using", "is not properly configured. This is very rarely a Vagrant issue."]}
From the error message that you posted:
The guest machine entered an invalid state while waiting for it to boot. Valid states are 'starting, running'. The machine is in the 'gurumeditation' state. Please verify everything is configured properly and try again.
That's a trippy error message if I've ever seen one π
If the provider you're using has a GUI that comes with it, it is often helpful to open that and watch the machine, since the GUI often has more helpful error messages than Vagrant can retrieve. For example, if you're using VirtualBox, run
vagrant up
while the VirtualBox GUI is open.
Hmm, accessing the GUI will not be possible, but maybe we can get the actual VirtualBox logs out?
@timothystewart6 btw if you are testing runner types, it might be worthwhile to check out macos-13
. You even get 33 % more cores (4 instead of 3) π
@sleiner good eye! I only saw latest
/12
and tested that. I will try to log out more from VB later too!
Ah, vagrant
doesn't seem to be installed on macos-13
, might need to brew
install it. I feel like at that point the runner should just be on ubuntu
since it makes a self-hosted version possible at some point (unless I buy a mac mini π
)
at that point the runner should just be on ubuntu since it makes a self-hosted version possible at some point (unless I buy a mac mini π )
In principle, porting this to Ubuntu is trivial (since Molecule, Vagrant and VirtualBox all work on Ubuntu too). In GitHub Actions, we are using macOS however, because at the time these were the only GitHub-hosted runners with support for nested virtualization. In the mean time, GitHub seems to have added KVM support to large runners (i.e., the ones that are not free to use). So as far as I see, with GitHub-hosted runners, macOS is the only viable option.
Yeah, I just read through all of those threads. Seems like enabling it on macOS was an oversight. I wonder if that's why vagrant isn't installed on 13, maybe they "fixed the bug" π. I'll keep plowing through this and figure it out!
Hey
First of all let me say that I am not familiar with molecule but I tried to run the test workflow on my fork and it looks like this is facing some resource starvation.
One of the steps ran without an issue: Run Link
Are we provisioning several vms at the same time? If so, could we split those workloads? (at least just to test) I am getting a Disconnect error which mean that the runner service died. I've had this happen when the host ran out of memory or with high CPU times (we fixed those with a higher nice but I doubt that GH managed runners allow that easily)
It could also be the cache since it only worked the first time
Hope that helps
according to this
"It's getting a triple fault error, which possibly means disk corruption"
So another vote on the cache
I think I found the answer is that we are running out of resources during CI. Reducing allocation seemed to fix it on the first attempt. I have a feeling that sometimes we would get lucky and land on a macOS machine that had enough resources to process our CI, most other times it would not.
Similar Issue (it's not just us) https://github.com/actions/runner-images/issues/8730
After that change, it worked on the first try.
I am most likely going to add some additional changes to this job.
1) Create a cache dependencies pre-step. Caching them when we run the jobs in parallel is a race condition.
2) Update to the latest macOS
runner version. This might lead to installing vagrant
and virtualbox
ourselves. Details of image here
3) I might consider running these in serial to limit runner usage. This does not change anything functionally but is really about being a good neighbor. Also, if one fails, there's no need to run the others.
After working on this for about 2 days straight, I think I finally understand what's going on
https://github.com/actions/runner-images/issues/8642
Turns out nested virtualization was working on some macos-12
images, but they've slowly been replacing them with new images. This explains why sometimes our builds would pass, we got lucky and landed on one of the old images that support it.
Turns out nested virtualization is only supported on their large runners, which is a paid teams feature. I will look into cost and I will also consider running self-hosted runners. CI for this is pretty important but also not sure I can afford GitHub Team when I am a solo developer.
I signed up for teams, converted to ubuntu, and provisioned a large runner and I am still seeing the same thing. That thread has so many conflicting comments it's hard to know what's right. Someone also mentioned after the update to ubuntu
runners the other day that it now works on ubuntu-latest
too. Testing that too. If this all fails I will self host a runner with nested virtualization turned on.
Good news, my self-hosted runners are working! I am going to clean up this PR before merge, I will also update the summary too.
Updated PR description to reflect everything that I did. Going to merge after tests pass.
Here it is in action!
Just reporting back, we now have 3 self-hosted runners so jobs can run concurrently
Proposed Changes
I don't even know where to begin π
macos-12
imagemacos-13
ubuntu
not macOSI also used this PR to reworking CI steps and caching. While troubleshooting this CI job I found that it was better to cache things up front in a pre step, vs letting each step decide. Steps downstream only restore cache. Also, it seem like a race condition that all 3 molecule tests could download all vms and then save all vms. Caching them up front and forcing them to download only ensures that there's no unnecessary save later.
All of this should work. I am going to merge it shortly after the tests pass (again) so that we can get caught up on PRs
Checklist
site.yml
playbookreset.yml
playbook