Revise testing infrastructure to decrease spurious failures

cboettig commented 8 months ago

The testing infrastructure fails to successfully test any cuda images due to disk space limitations. This separates out the testing of cuda images from other scripts.

Tests involving rstudio-daily are also failing continuously due to server issues with the downloads.

Both of these create conditions that guarantee test failure, making it impossible for PRs to satisfy all checks. This either prevents PR contributions or means that PR contributions are merged with some failing tests, neither of which is a good solution.

cboettig commented 8 months ago

@eitsupi I think all the tests in https://github.com/rocker-org/rocker-versioned2/actions/runs/7721419467/workflow?pr=759 run on the same runner? public repo runners for linux have 150 GB. it looks like just most small change triggers the full matrix of tests, and at a the moment that full matrix of tests just can't actually run on the runner, as there isn't space? e.g in this initial test PR which shows 6 of the 38 tests failing. They fail for various reasons, though none are related to the change in the PR, they are either network issues or lack of space.

I think your plans in https://github.com/rocker-org/rocker-versioned2/issues/755 to better leverage cache in the redesign build infrastructure will greatly improve this situation. But meanwhile I think it's difficult to make a meaningful PR against the repo that won't hit failed tests for unrelated issues, especially the disk space error.

I will try testing out some options here for a potentially slimmer test matrix here as a work-around...

eitsupi commented 8 months ago

As I commented elsewhere, the capacity issue should be resolved by removing unnecessary software. For example: https://github.com/rocker-org/rocker-versioned2/blob/26c50e561ae4b10386b9f7adaa37a77b52f7f5d6/.github/workflows/reports.yml#L49-L51

I think this is affected by a change made a while ago that increased the rocker/cuda image's capacity by several gigabytes.

I think reducing tests and merging incorrect changes is just as bad an idea as ignoring and merging tests that randomly fail.

cboettig commented 8 months ago

@eitsupi Thanks for your help here. To be clear, we are entirely on the same page about not reducing testing and not merging incorrect changes. Having unrelated tests fail for unrelated reasons does not reduce incorrect changes. The current testing does not cover most cases anyway, and I'd like to actually add more tests to get better coverage, not less. As I said at the top of this, I'm not seeking to remove tests overall, I'm testing the removal of tests here to try and get a handle on disk use so I can add tests. We are on the same objective here.

I cannot currently fix things that are broken and have been broken for a long time and have never been covered by our tests while PRs are throwing errors that are entirely unrelated to those changes and are also not reflective of problems either in the existing stack or the proposed changes.

I don't understand the solution you are proposing -- that we shrink the size of the images themselves or that we free space from other software on the host runner? You suggested removing arrow libraries I think? As I noted in #756, adding support for tensorflow and torch libraries adds 4 - 5 GB each. If you are aware of unnecessary software that could free enough extra space to test ML libraries then a PR would be awesome. The test images should have 150 GB of disk. We should be able to run tests on the 13 GB base cuda image. While the matrix setup is nice, perhaps it would avoid these issues to have tests handled on different runners?

The current test design is not compatible with testing the large images involved in the machine learning stack. I suggest we move that testing to a separate runner. We may want it on a separate runner anyway so we have the option of self-hosted runner setup to run these images on GPU.

eitsupi commented 8 months ago

Sorry I didn't explain better, but my point was that we can free up space by deleting unnecessary stuff on the runner, which is what Apache Arrow's main repository does thoroughly, and I think the script below does just that. https://github.com/apache/arrow/blob/787afa1594586d2d556d21471647f9cd2c55b18f/ci/scripts/util_free_space.sh

If I remember correctly, all testing is done on a separate runner, so the reason why the test for the ML image fails is simply because that image is too huge.

cboettig commented 8 months ago

In the above edits, I have moved cuda images out of the tests/rocker_scripts/matrix.json, since that runner simply does not have enough space to test anything involving the now 13 GB cuda base image, let alone the potentially larger derivative images.

I've moved cuda into a separate workflow. I initially structured this off the same design as the rocker_scripts test, but even with a single test there it doesn't have enough space. I don't really understand why -- maybe buildkit is using additional space? I rewrote the action to concise simple docker build test, which runs just fine. Anyway I think this is the correct direction to go in -- after all, as I mentioned, I'd like to consider actually testing cuda images on GPU machines with self-hosted runners.

@eitsupi It would be great if you'd like to add those image-size-reducing changes from arrow, I'm reluctant to paste them in here as it adds complexity to the build system and I don't really understand what it is doing. (For instance, I don't see how it can really free 20 GB from every ubuntu 22.04 runner, when according to the GtiHub docs runners on private repos don't even have 20 GB SSDs to begin with...)

Anyway, I'd like to move forward on this so we can get testing unstuck for #756 and so we can start providing the main ML frameworks on our ML-tagged images.

rocker-org / rocker-versioned2

Revise testing infrastructure to decrease spurious failures #759