[CI] Investigating CI Runners

AlexandreSinger commented 1 month ago

The self-hosted runners were down for the last couple of days and has only now gotten back up. I wanted to investigate any anomolies in the logs of the CIs to see if we have any issues in the testcases we are running which my have caused it.

The motivation behind this investigation is this message produced by the CI when the self-hosted runners were not working:

The self-hosted runner: gh-actions-runner-vtr_XX lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

I went through the logs of the last working nightly test on the master branch ( https://github.com/verilog-to-routing/vtr-verilog-to-routing/actions/runs/9932067866 ) and here are the results for the jobs run on the self-hosted runners (this data was collected from the figures at the bottom of the logs). I also collected their run time since I thought it may be valuable.	Job Name	Average CPU Usage (%)	Max RAM Usage (GB)	Max /dev/sda2 Usage (GB)	Max eth0 Data Received (Mb/s)	Max eth0 Data Sent (Mb/s)
"Capacity"	100	125.82	492.0	-	-	-
Run-tests (vtr_reg_nightly_test1, 16)	32.58	6.74	32.4	577.73	69.72	2h 24m 25s
Run-tests (vtr_reg_nightly_test1_odin, 16, -DWITH_ODIN=ON)	43.56	7.57	40.89	546.61	64.28	3h 5m 12s
Run-tests (vtr_reg_nightly_test2, 16)	48.09	100.03	97.97	630.83	33.65	4h 20m 20s
Run-tests (vtr_reg_nightly_test2_odin, 16, -DWITH_ODIN=ON)	54.27	98.35	97.88	789.33	64.23	3h 39m 22s
Run-tests (vtr_reg_nightly_test3, 16)	64.54	16.66	33.2	551.44	69.02	2h 0m 3s
Run-tests (vtr_reg_nightly_test3_odin, 16, -DWITH_ODIN=ON)	45.93	11.81	39.16	789.53	44.53	3h 9m 4s
Run-tests (vtr_reg_nightly_test4, 16)	44.29	53.45	49.67	789.48	41.61	3h 15m 17s
Run-tests (vtr_reg_nightly_test4_odin, 16, -DWITH_ODIN=ON)	46.42	14.11	37.86	554.0	33.6	1h 15m 5s
Run-tests (vtr_reg_nightly_test5, 16)	47.6	85.94	38.1	789.78	58.08	3h 20m 28s
Run-tests (vtr_reg_nightly_test6, 16)	19.02	74.72	32.35	692.52	6.35	4h 15m 20s
Run-tests (vtr_reg_nightly_test7, 16)	66.39	38.6	35.68	556.99	36.52	50m 9s
Run-tests (vtr_reg_strong, 16, -DVTR_ASSERT_LEVEL=3, libeigen3-dev)	42.67	8.13	5.76	507.23	64.27	15m 10s
Run-tests (vtr_reg_strong_odin, 16, -DVTR_ASSERT_LEVEL=3 -DWITH_ODIN=ON, libeigen3-dev)	31.59	7.71	32.27	582.84	50.31	19m 52s
Run-tests (vtr_reg_strong_odin, 16, -skip_qor, -DVTR_ASSERT_LEVEL=3 -DVTR_ENABLE_SANITIZE=ON -DWI...	63.43	20.03	32.31	756.76	56.57	1h 4m 28s
Run-tests (vtr_reg_system_verilog, 16, -DYOSYS_F4PGA_PLUGINS=ON)	29.12	8.96	32.32	789.53	12.13	22m 18s
Run-tests (odin_reg_strong, 16, -DWITH_ODIN=ON)	7.59	17.07	15.74	286.56	12.24	1h 1m 3s
Run-tests (parmys_reg_strong, 16, -DYOSYS_F4PGA_PLUGINS=ON)	3.66	26.3	31.58	789.63	10.23	2h 47m 31s

The biggest thing that catches my eye is the RAM usage for some of the tests are very close to (what I think to be) the capacity of the machine (125 GB). This is caused by each job using 16 cores to run each test. I doubt this is what caused the problem, since we still have some head room.

I also noticed that few tests take longer than others. Just something to note down.

My biggest concern is that, since some of these jobs are so close to the limit; changes people may be making locally in their PRs while developing may cause the CI to have some issues. For example, if someone accidentally put a memory leak in their code while developing and push the code without testing locally it may bring down the CI. This does not appear to be what happened here since the last run of the CI succeeded without such issues.

I wanted to raise this investigation as an issue to see what people think.

AlexandreSinger commented 1 month ago

@soheilshahrouz I know you would find this interesting.

vaughnbetz commented 1 month ago

Thanks @AlexandreSinger. Rebalancing is good to shorten the long pole, so PRs complete CI faster. Once VTR 9 is out (hopefully in a few months), we are likely to deprecate or significantly reduce testing on Odin-II, which would cut our usage a reasonable amount. There may be other efficiencies to be had too ...

AlexandreSinger commented 1 month ago

The self-hosted runners are down again. I have been looking into the runs that are failing and I am noticing that some jobs are requesting servers that do not exists (for example trying to call self-hosted machine 200+ which does not exists as far as I can tell).

I have been looking around google and found the following post: https://github.com/actions/runner/issues/2040#issuecomment-1367348537

One thing I noticed is this person who changed the runner group of the self-hosted machine and it fixed it. The runner group is what give the machines their numbers. @vaughnbetz if this persists into tomorrow you and I can look into this. I cannot do it on my end since I do not have permission on the VTR repo (also this is probably something we want to do carefully; it may be easy to remove the runners from the group, I worry that adding them back may require access to them).

AlexandreSinger commented 1 month ago

My running theory about what may be going on is perhaps the runner version is so far behind that its beginning to have compatibility issues with GitHub. For the last month or so, we have not be able to read the logs visually within the GitHub UI. We kinda ignored this issue since the logs were still accessible through the test settings; however, this may have been an indication that something is wrong with the runners.

I wonder if GitHub does not expect people to be on the runner version we may be using and we now may be facing a full deprecation. This may explain why the self-hosted runners did not work for a couple of days, then worked again for a couple of days, and are now not working again. Perhaps behind the scenes GitHub is making changes and are only testing on recent runner versions.

I am still not sure which version of the GitHub runners we are currently using; all that I know is that our version of the GitHub runners must be less than v2.308.0 (since that is the most recent version which produced the error we saw when upgrading the actions in a previous issue).

vaughnbetz commented 1 month ago

Thanks. @hzeller @mithro @QuantamHD : we're looking to update the self-hosted runners in the Google Cloud to a later image/github action version. However, we're not sure where the image is stored or how to update it. Help would be much appreciated!

verilog-to-routing / vtr-verilog-to-routing

[CI] Investigating CI Runners #2652