riscv-collab / riscv-gnu-toolchain

GNU toolchain for RISC-V, including GCC
Other
3.56k stars 1.17k forks source link

Free up space in GitHub Actions Runners for remaining jobs #1601

Closed jordancarlin closed 3 weeks ago

jordancarlin commented 3 weeks ago

Attempt to resolve issue #1591 by creating more free space on the runner for the jobs that are failing.

jordancarlin commented 3 weeks ago

@cmuellner @TommyMurphyTM1234 Would be good to get this merged so we can finally get a gcc 14 nightly release

TommyMurphyTM1234 commented 3 weeks ago

A few questions...

  1. What's the rationale for deleting the .NET and Android frameworks?
  2. Why are they installed in the first place?
  3. If they are to be removed then why not with apt uninstall rather than rm?
  4. Do we know that removing them frees up much/enough space?
  5. Why not remove them once at "init" time?
  6. Are there perhaps other things that could/should be removed at "init" time or elsewhere?
  7. What controls how much disk space an action container (?) is allocated?
  8. Why not use something like this?

I realise that these commands were already present in other places in the actions but I didn't understand them there either and was always meaning to ask about them.

cmuellner commented 3 weeks ago

Thanks for the PR!

Are you sure that manually deleting unused distro components is sufficient to address the problem? I.e. Have you reproduced the issue and verified that this fixes it?

If so, then we might consider uninstalling packages.

jordancarlin commented 3 weeks ago

@TommyMurphyTM1234 In this case I just went with what had already been done in the other jobs for this workflow, but I dealt with a more complicated version of this for another project so can provide some context.

GitHub Actions runners are only guaranteed to have 14 GB of free disk space. In practice they tend to have something in the 20-25 GB range free. The actual runners are much larger (close to 75 GB), but much of that is used up by the default container configuration.

To answer you questions:

What's the rationale for deleting the .NET and Android frameworks?

These are two of the largest preinstalled items (collectively using 9 GB) and neither are needed for these workflows. Presumably when these jobs were first created they were selected as easy targets to recover space.

Why are they installed in the first place?

See https://github.com/actions/runner-images for details on what comes preinstalled on the GitHub Action runners. They try to preload most of the software people might need to reduce CI time and avoid the need to install various components every time.

If they are to be removed then why not with apt uninstall rather than rm?

I believe the runner does not install Android with apt, so it must be manually deleted.

Do we know that removing them frees up much/enough space?

All of the artifacts that the failing job is trying to download take up ~25 GB in total. Removing these two components gives us ~30 GB of free space.

Why not remove them once at "init" time?

Each job is started in a new container, so there is no way to make things persist between them. There is no "init" time that applies to everything in the workflow.

Are there perhaps other things that could/should be removed at "init" time or elsewhere?

I created a script that removes almost all of the preinstalled software for another repo (https://github.com/openhwgroup/cvw/blob/main/.github/cli-space-cleanup.sh). With that script the total available free storage increases to 61 GB. If we want to ensure this isn't an issue in the future we could do something like that to remove more software, but it seems like it is probably unnecessary for this.

What controls how much disk space an action container (?) is allocated?

All GitHub Actions runners are created from the container image linked above and guaranteed to have at least 14 GB of free space. Anything beyond that will fluctuate as the images are updated.

Why not use something like this? https://github.com/marketplace/actions/maximize-build-disk-space

We definitely could, but most of those actions do a lot other strange things (recreating the filesystem to merge another unused disk) that seem much more likely to break as the image is updated. Just removing software shouldn't fail even if that software were to no longer be included.

cmuellner commented 3 weeks ago

The error log does not tell much. I assume we have not reached the disk space limit of a release's build artifacts, but we have reached the limit of the build machine that does the release step (create a release, download all toolchains, upload toolchains to release).

If my assumption is right, then the issue is that we now have 24 toolchains, and our approach of downloading them all at once and pushing them to the release is not working. Possible solutions: either we move the upload part to the toolchain builders, or we process one toolchain at a time (download from build, upload to release, delete).

jordancarlin commented 3 weeks ago

The error log does not tell much. I assume we have not reached the disk space limit of a release's build artifacts, but we have reached the limit of the build machine that does the release step (create a release, download all toolchains, upload toolchains to release).

If my assumption is right, then the issue is that we now have 24 toolchains, and our approach of downloading them all at once and pushing them to the release is not working. Possible solutions: either we move the upload part to the toolchain builders, or we process one toolchain at a time (download from build, upload to release, delete).

Yes. That is what I see as well. The job that downloads all of them runs out of space. The easiest solution would be to create more space on that runner, but changing the workflow to avoid downloading them all could also work. I'm not sure how to upload to the same release from multiple jobs though.

cmuellner commented 3 weeks ago

If they are to be removed then why not with apt uninstall rather than rm?

I believe the runner does not install Android with apt, so it must be manually deleted.

For dotnet this could work: apt remove --purge dotnet*.

For things under /usr/local this is different. We can do the manual remove thing, but then we should justify the path with a comment that references https://github.com/actions/runner-images/blob/main/images/ubuntu/Ubuntu2204-Readme.md

E.g.

Removal of preinstalled Android NDK/SDK components in the image. The installation path is documented here: ...

I did not look into the installation scripts of the images that install Android in the images. There might be a better way to remove it.

TommyMurphyTM1234 commented 3 weeks ago

@TommyMurphyTM1234 In this case I just went with what had already been done in the other jobs for this workflow, but I dealt with a more complicated version of this for another project so can provide some context.

Thanks @jordancarlin for the explanations. I guess that I don't understand enough about the GitHub actions/runners etc. and need to read up on them a bit. 🙂

jordancarlin commented 3 weeks ago

Maybe the best solution is to create a small script that uninstalls several of the tools (can be a simplified version of the one I linked above) and call that script at the beginning of each job. That way it is centralized in one place and can be easily updated if needed.

cmuellner commented 3 weeks ago

After looking into the CI/CD script again, I don't think we need to discuss or change this PR further. I'll just merge the change, as we already have the exact same code in our repo: https://github.com/riscv-collab/riscv-gnu-toolchain/blob/master/.github/workflows/nightly-release.yaml#L61

That's enough justification to merge this change.

cmuellner commented 3 weeks ago

Thanks again for the PR, @jordancarlin!

jordancarlin commented 3 weeks ago

Great. Hopefully that'll solve our issues.

TommyMurphyTM1234 commented 3 weeks ago

Is it still failing in spite of this change?

jordancarlin commented 3 weeks ago

Looks like it was a transient network issue that time