timothystewart6 commented 1 year ago

Proposed Changes

I don't even know where to begin 😅

GitHub no longer seems to allow us to use nested virtualization on the macos-12 image
We need nested virtualization in order to create VMs during CI to run our playbook
Nested virtualization isn't enabled on macos-13
They claim it's enabled for large runner instances
- These are paid
- They are ubuntu not macOS
- I signed up, tested, and it didn't seem to work
Rewrote the playbook to work with Linux (ubuntu)
I ended up self-hosting a VM myself
- Nested virtualization
- 16 cores
- 64 GB RAM
- 100 GB SSD
- Due to nested virtualization it's still slow, but stable
- Another downside is that you can only run on job at a time on a self-hosted runner. GHA really needs to start using containers like every other CI out there instead of stateful VMs.
- If this becomes a problem I will break the one I have into 3. It doesn't effectively or efficiently use the resources I gave it now, probably due to nested virtualization.
I ended up upgrading all dependencies while trying to make this work with GitHub's macOS images
Our days were numbered with macOS anyway, Virtual Box isn't supporting Apple Silicon
I should switch to Vagrant + VirtIO at some point.

I also used this PR to reworking CI steps and caching. While troubleshooting this CI job I found that it was better to cache things up front in a pre step, vs letting each step decide. Steps downstream only restore cache. Also, it seem like a race condition that all 3 molecule tests could download all vms and then save all vms. Caching them up front and forcing them to download only ensures that there's no unnecessary save later.

All of this should work. I am going to merge it shortly after the tests pass (again) so that we can get caught up on PRs

Checklist

[x] Tested locally
[x] Ran site.yml playbook
[x] Ran reset.yml playbook
[x] Did not add any unnecessary changes
[x] Ran pre-commit install at least once before committing
[x] 🚀

sleiner commented 1 year ago

@timothystewart6 I have just seen this - and I am not convinced that this would actually improve CI in its current form:

Why would running the Molecule jobs serially improve reliability? We are using GitHub-hosted runners, so we have no control over the actual machine that the jobs run. After checking the docs, I am very confident that "scheduling jobs from the same workflow" is not a feature that GitHub actions currently has. You can pin self-hosted runners to a on organization or repo or require certain labels for specific jobs, so if we had a self-hosted runner, we could set GitHub Actions up to use only that one and set the runner up to only accept one job at the time. But for GitHub-hosted runners, everything goes in to the same pool of shared runners, so limiting our concurrency does not help us. Instead, as far as I see, it only increases the time until all CI results are actually available.
Concerning the Vagrant cache: I added this because (at the time at least), the GitHub runners were frequently (~30-50% of the time) getting HTTP 429: Too many requests responses from Vagrant cloud. So the cache improved reliability (and we were better citizens towards Vagrant Cloud) at the time. If that problem is not acute anymore, it is safe to remove the cache IMO.

timothystewart6 commented 1 year ago

@timothystewart6 I have just seen this - and I am not convinced that this would actually improve CI in its current form:

Why would running the Molecule jobs serially improve reliability? We are using GitHub-hosted runners, so we have no control over the actual machine that the jobs run. After checking the docs, I am very confident that "scheduling jobs from the same workflow" is not a feature that GitHub actions currently has. You can pin self-hosted runners to a on organization or repo or require certain labels for specific jobs, so if we had a self-hosted runner, we could set GitHub Actions up to use only that one and set the runner up to only accept one job at the time. But for GitHub-hosted runners, everything goes in to the same pool of shared runners, so limiting our concurrency does not help us. Instead, as far as I see, it only increases the time until all CI results are actually available.

Concerning the Vagrant cache: I added this because (at the time at least), the GitHub runners were frequently (~30-50% of the time) getting HTTP 429: Too many requests responses from Vagrant cloud. So the cache improved reliability (and we were better citizens towards Vagrant Cloud) at the time. If that problem is not acute anymore, it is safe to remove the cache IMO.

Thanks @sleiner I am coming to the same conclusion too. This was more of debugging experiment to track down build errors. Our error seems to be a Virtual Box error.

First was to run jobs serially to see if concurrency was an issue since we are using hosted runners. My suspicion was that we might be scheduled on the same host and somehow there was a conflict with 2 virtual box instances running.
Next was to remove build caches (I manually cleared them from the runners)
Then removed the cache job from CI to see if we might have cached a bad state for the Vagrant VMs.

This PR will not be merged, I am just working some of my suspicions, I have a few more to chase down.

Virtual Box Error

 fatal: [localhost]: FAILED! => {"changed": false, "cmd": ["/usr/local/bin/vagrant", "up", "--no-provision"], "msg": "Failed to start the VM(s): See log file '/Users/runner/.cache/molecule/k3s-ansible/default/vagrant.err'", "rc": 1, "stderr": "### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\nThe guest machine entered an invalid state while waiting for it\nto boot. Valid states are 'starting, running'. The machine is in the\n'gurumeditation' state. Please verify everything is configured\nproperly and try again.\n\nIf the provider you're using has a GUI that comes with it,\nit is often helpful to open that and watch the machine, since the\nGUI often has more helpful error messages than Vagrant can retrieve.\nFor example, if you're using VirtualBox, run `vagrant up` while the\nVirtualBox GUI is open.\n\nThe primary issue for this error is that the provider you're using\nis not properly configured. This is very rarely a Vagrant issue.\n", "stderr_lines": ["### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "The guest machine entered an invalid state while waiting for it", "to boot. Valid states are 'starting, running'. The machine is in the", "'gurumeditation' state. Please verify everything is configured", "properly and try again.", "", "If the provider you're using has a GUI that comes with it,", "it is often helpful to open that and watch the machine, since the", "GUI often has more helpful error messages than Vagrant can retrieve.", "For example, if you're using VirtualBox, run `vagrant up` while the", "VirtualBox GUI is open.", "", "The primary issue for this error is that the provider you're using", "is not properly configured. This is very rarely a Vagrant issue."]}

sleiner commented 1 year ago

From the error message that you posted:

The guest machine entered an invalid state while waiting for it to boot. Valid states are 'starting, running'. The machine is in the 'gurumeditation' state. Please verify everything is configured properly and try again.

That's a trippy error message if I've ever seen one 😅

If the provider you're using has a GUI that comes with it, it is often helpful to open that and watch the machine, since the GUI often has more helpful error messages than Vagrant can retrieve. For example, if you're using VirtualBox, run vagrant up while the VirtualBox GUI is open.

Hmm, accessing the GUI will not be possible, but maybe we can get the actual VirtualBox logs out?

sleiner commented 1 year ago

@timothystewart6 btw if you are testing runner types, it might be worthwhile to check out macos-13. You even get 33 % more cores (4 instead of 3) 😄

timothystewart6 commented 1 year ago

@sleiner good eye! I only saw latest/12 and tested that. I will try to log out more from VB later too!

timothystewart6 commented 1 year ago

Ah, vagrant doesn't seem to be installed on macos-13 , might need to brew install it. I feel like at that point the runner should just be on ubuntu since it makes a self-hosted version possible at some point (unless I buy a mac mini 😅 )

sleiner commented 1 year ago

at that point the runner should just be on ubuntu since it makes a self-hosted version possible at some point (unless I buy a mac mini 😅 )

In principle, porting this to Ubuntu is trivial (since Molecule, Vagrant and VirtualBox all work on Ubuntu too). In GitHub Actions, we are using macOS however, because at the time these were the only GitHub-hosted runners with support for nested virtualization. In the mean time, GitHub seems to have added KVM support to large runners (i.e., the ones that are not free to use). So as far as I see, with GitHub-hosted runners, macOS is the only viable option.

timothystewart6 commented 1 year ago

Yeah, I just read through all of those threads. Seems like enabling it on macOS was an oversight. I wonder if that's why vagrant isn't installed on 13, maybe they "fixed the bug" 😆. I'll keep plowing through this and figure it out!

joaopedrocg27 commented 10 months ago

Hey

First of all let me say that I am not familiar with molecule but I tried to run the test workflow on my fork and it looks like this is facing some resource starvation.

One of the steps ran without an issue: Run Link

Are we provisioning several vms at the same time? If so, could we split those workloads? (at least just to test) I am getting a Disconnect error which mean that the runner service died. I've had this happen when the host ran out of memory or with high CPU times (we fixed those with a higher nice but I doubt that GH managed runners allow that easily)

It could also be the cache since it only worked the first time

Hope that helps

joaopedrocg27 commented 10 months ago

according to this

"It's getting a triple fault error, which possibly means disk corruption"

So another vote on the cache