Nightly release CI action is broken

TommyMurphyTM1234 commented 3 weeks ago

See here:

https://github.com/riscv-collab/riscv-gnu-toolchain/releases/tag/2024.10.23

https://github.com/riscv-collab/riscv-gnu-toolchain/actions/runs/11470734480

Nightly Release
Error when evaluating 'strategy' for job 'upload-assets'. .github/workflows/nightly-release.yaml (Line: 199, Col: 15): Matrix must define at least one vector

Only source bundles generated, no binary toolchains.

TommyMurphyTM1234 commented 3 weeks ago

This seems to be the culprit but I don't really understand it yet...

https://github.com/riscv-collab/riscv-gnu-toolchain/blob/935b263c8ef7f250819c74aeb7736c87ad87ef2b/.github/workflows/nightly-release.yaml#L198-L199

TommyMurphyTM1234 commented 3 weeks ago

The last successful nightly release was 3rd September 2024:

https://github.com/riscv-collab/riscv-gnu-toolchain/releases
- https://github.com/riscv-collab/riscv-gnu-toolchain/releases/tag/2024.09.03

so I presume that one of the commits since that date caused the problem?

https://github.com/riscv-collab/riscv-gnu-toolchain/commits/master/.github/workflows/nightly-release.yaml

I hope it's not one of mine! :-)

cmuellner commented 3 weeks ago

I created a PR that should address this issue: https://github.com/riscv-collab/riscv-gnu-toolchain/pull/1592

TommyMurphyTM1234 commented 3 weeks ago

I created a PR that should address this issue: #1592

Thanks @cmuellner. 👍

TommyMurphyTM1234 commented 3 weeks ago

Any idea why the nightly build still doesn't seem to be working or, at least, hasn't completed and uploaded a complete set of built artifacts yet?

cmuellner commented 3 weeks ago

I was waiting for a review of my PRs that address issues in the CI/CD scripts (#1582 and #1592). I just merged them without receiving a review.

TommyMurphyTM1234 commented 3 weeks ago

Still something wrong I guess? Only sources in the latest release again.

Edit: oh, out of disk space? Even though it's supposed to clean up after itself as far as I can see?

Does it maybe need to do more to clean up? Do older release artifacts need to be deleted? Are the changes to enable additional Linux musl and uClibc builds exceeding the available resources?

jordancarlin commented 3 weeks ago

It looks like it is the "create release" job that is running out of space. It downloads all of the artifacts from previous steps, which take up 25 GB but the runner only has 21 GB available. Each job is run on a separate runner, so the space needs to be cleaned up in this job too.

cmuellner commented 2 weeks ago

The CI seems to be regularly (I've observed this multiple times since we added the musl builds into the CI/CD) broken because of git/musl issues:

error: RPC failed; HTTP 504 curl 22 The requested URL returned error: 504
fatal: expected 'packfile'
fatal: clone of 'https://git.musl-libc.org/git/musl' into submodule path '/home/runner/work/riscv-gnu-toolchain/riscv-gnu-toolchain/musl' failed
Failed to clone 'musl' a second time, aborting

I'm not sure what the best way to move forward here is.

TommyMurphyTM1234 commented 2 weeks ago

https://git.musl-libc.org/git/musl

That's the wrong URL as far as I can see:

https://musl.libc.org/
- https://git.musl-libc.org/cgit/musl

Edit: ah - sorry - ignore that...

https://git.musl-libc.org/cgit/musl

mickflemm commented 2 weeks ago

Maybe we are hitting an issue with HTTP (e.g. http.postBuffer is not enough to hold the pack file), does doing a shallow clone solve this ? Yocto at some point used this mirror: https://github.com/kraj/musl/tree/master it seems to be up to date.

Maybe @richfelker can help.

richfelker commented 2 weeks ago

Can you provide a minimal test case to reproduce the failure to git-clone? I just cloned successfully.

richfelker commented 2 weeks ago

FWIW if you're re-cloning on every CI job, the polite thing to do is make it a shallow clone. The more polite thing to do would be to cache git clones. But I don't think this is related to the problem.

TommyMurphyTM1234 commented 2 weeks ago

FWIW if you're re-cloning on every CI job, the polite thing to do is make it a shallow clone.

FWIW that's what this recent PR was intended to deal with but it's closed pending further investigations:

https://github.com/riscv-collab/riscv-gnu-toolchain/pull/1603

The more polite thing to do would be to cache git clones. But I don't think this is related to the problem.

Do you know what this would involve for this repo's actions?

richfelker commented 2 weeks ago

Do you know what this would involve for this repo's actions?

No, I don't. I've really avoided getting into CI workflows myself because I deem them a gigantically irresponsible abuse of resources. So I'm not sure what tooling there is to fix this (avoid downloading the same thing thousands of times), but it's something I very much hope someone is working on.

cmuellner commented 2 weeks ago

I've really avoided getting into CI workflows myself because I deem them a gigantically irresponsible abuse of resources.

I'm in the same camp. However, there is a significant demand for pre-built releases of this repo. The automatic builds, which trigger new releases if new changes are merged, broke in August. Since then, people have regularly reached out as they want them back.

A possible solution is to have a mirror repo on Github, which regularly pulls the changes from the official repo. This reduces the load on upstream git servers.

TommyMurphyTM1234 commented 2 weeks ago

A possible solution is to have a mirror repo on Github, which regularly pulls the changes from the official repo. This reduces the load on upstream git servers.

Another possibility might be to wget source tarballs for those components that have clearly defined releases?

TommyMurphyTM1234 commented 2 weeks ago

In case this helps at all (may belong elsewhere?):

Notes:

For each component the master tarball link is listed along with a link to the version currently used by riscv-gnu-toolchain (except where noted otherwise below)
Where different tarball formats are available the smallest/most compressed option was selected
Not sure what specific llvm tarball is relevant
Ditto for dejagnu
Spike/pk repos don't seem to create regular release/snapshot tarballs so this may be an option for these components yet?

mickflemm commented 2 weeks ago

On one hand if we run CI very often it's indeed a waste of resources, on the other hand it's useful for regression testing (and we can't solve that with a mirror repo btw, we can't check for example pull requests that way, and it's super useful), and it's an even worse waste of resources to have the users of this repo (or their CIs) building this repo again and again. That being said there are a few ways to optimize the flow, here are a few suggestions:

1) Mark all submodules for shallow cloning, this is better than wget IMHO since it'll work for all repos (even for those that don't create release tarballs), and will also be easier to update them. It'll make the build process faster for our users too.

2) I don't know if we can preserve the build environment across builds, that would help a lot since at this point we clone everything on every job. One approach would be to use https://github.com/actions/cache for sharing the cloned repos (or maybe have one job that would clone everything and cache its output for the others) but I haven't tested it (I've seen others using artifact upload/download to share data across jobs).

3) Instead of nigthly releases we can do weekly releases, it's a small change on nightly-release.yaml, it doesn't make much sense to have a release every day, even monthly releases would be fine.

4) For improving the size of the generated toolchains, we could deduplicate files using their hashes and switch them to hardlinks (I've tested it and it works fine), tar will preserve those so when user unpacks it there will also be a benefit, not only for the tarball's size. Then we could switch form gz to a more efficient compression, I use xz and to improve its efficiency further I first create the tarball and then compress it with xz -e -T0 (this is better than tar cvJf since it has the opportunity to create a better dictionary).

TommyMurphyTM1234 commented 2 weeks ago

Then we could switch form gz to a more efficient compression, I use xz

Tarball repositories that I've seen (e.g. see above) suggest that LZ compression may be even better than XZ (at least from a compression perspective, not sure if it's slower?).

cmuellner commented 2 weeks ago

Mark all submodules for shallow cloning, this is better than wget IMHO since it'll work for all repos (even for those that don't create release tarballs), and will also be easier to update them. It'll make the build process faster for our users too.

PR exists (#1605).

I don't know if we can preserve the build environment across builds, that would help a lot since at this point we clone everything on every job. One approach would be to use https://github.com/actions/cache for sharing the cloned repos (or maybe have one job that would clone everything and cache its output for the others) but I haven't tested it (I've seen others using artifact upload/download to share data across jobs).

I also thought of this, but I have zero experience with it. It is hard to get up and running if it cannot be tested locally.

Instead of nigthly releases we can do weekly releases, it's a small change on nightly-release.yaml, it doesn't make much sense to have a release every day, even monthly releases would be fine.

We trigger the build every night, but it will not download/build anything if there were no changes in the last 24 hours.

For improving the size of the generated toolchains, we could deduplicate files using their hashes and switch them to hardlinks (I've tested it and it works fine), tar will preserve those so when user unpacks it there will also be a benefit, not only for the tarball's size. Then we could switch form gz to a more efficient compression, I use xz and to improve its efficiency further I first create the tarball and then compress it with xz -e -T0 (this is better than tar cvJf since it has the opportunity to create a better dictionary).

I will look into this. I usually use --threads=0 -6e for toolchain releases, as this gave the best results when I tested it a few years ago.

Thanks!

TShapinsky commented 2 weeks ago

2. I don't know if we can preserve the build environment across builds, that would help a lot since at this point we clone everything on every job. One approach would be to use https://github.com/actions/cache for sharing the cloned repos (or maybe have one job that would clone everything and cache its output for the others) but I haven't tested it (I've seen others using artifact upload/download to share data across jobs).

I'm working on a branch on my fork on this topic. Not happy with it quite yet. https://github.com/TShapinsky/riscv-gnu-toolchain/pull/2

TShapinsky commented 2 weeks ago

Another way toolchain size can be reduced is if stripped versions of the programs are used. A good portion of the dependencies already support a variant of make install-strip. I did some testing and it reduced the final toolchain output size by more than half (2.6GB->489MB)

mickflemm commented 1 week ago

Another way toolchain size can be reduced is if stripped versions of the programs are used. A good portion of the dependencies already support a variant of make install-strip. I did some testing and it reduced the final toolchain output size by more than half (2.6GB->489MB)

I'll play a bit with install-strip-host, IMHO we shouldn't strip target libraries.

mickflemm commented 1 week ago

BTW we install qemu as part of the toolchain for no reason (users would probably use their distro's qemu package), along with roms etc, we also install both 32/64bit qemu regardless the toolchain. Same goes for dejagnu, they both come from make-report (btw we also clone their repos each time, so I'll include them in the cache too). I'll see if I can clean this up a bit. Any idea if we need amdgpu-arch / nvptx-arch tools on llvm ? Is there a way to disable them (I don't think they make much sense in a cross-compile toolchain) ?

TShapinsky commented 1 week ago

BTW we install qemu as part of the toolchain for no reason (users would probably use their distro's qemu package), along with roms etc, we also install both 32/64bit qemu regardless the toolchain. Same goes for dejagnu, they both come from make-report (btw we also clone their repos each time, so I'll include them in the cache too). I'll see if I can clean this up a bit. Any idea if we need amdgpu-arch / nvptx-arch tools on llvm ? Is there a way to disable them (I don't think they make much sense in a cross-compile toolchain) ?

I think as far as the dependencies installed to make the report, the tarball should probably be created before running make report then it can be uploaded on successful completion of the report step.

Additionally, for fun I put together a workflow which incorporated ccache to speed up repeated compilation. Best case scenario (99% cache hit), it brings those 2 hour llvm builds down to ~35 minutes. And the size of a cache needed to support all build configurations is less than 3GB. https://github.com/TShapinsky/riscv-gnu-toolchain/actions/runs/11677459811

TommyMurphyTM1234 commented 1 week ago

+1 for ccache. I use it all the time:

https://github.com/riscv-collab/riscv-gnu-toolchain/issues/1196#issuecomment-1566717860

I didn't realise that it could maybe also be used in the CI GitHub actions.

mickflemm commented 1 week ago

BTW we install qemu as part of the toolchain for no reason (users would probably use their distro's qemu package), along with roms etc, we also install both 32/64bit qemu regardless the toolchain. Same goes for dejagnu, they both come from make-report (btw we also clone their repos each time, so I'll include them in the cache too). I'll see if I can clean this up a bit. Any idea if we need amdgpu-arch / nvptx-arch tools on llvm ? Is there a way to disable them (I don't think they make much sense in a cross-compile toolchain) ?

I think as far as the dependencies installed to make the report, the tarball should probably be created before running make report then it can be uploaded on successful completion of the report step.

That's also my approach, I also think we should install things on /mnt (the ssd partition that has 14GB guaranteed) instead of /opt but I'm still checking it out, I'll update the pull request during the weekend with more stuff.

Additionally, for fun I put together a workflow which incorporated ccache to speed up repeated compilation. Best case scenario (99% cache hit), it brings those 2 hour llvm builds down to ~35 minutes. And the size of a cache needed to support all build configurations is less than 3GB. https://github.com/TShapinsky/riscv-gnu-toolchain/actions/runs/11677459811

Although I like the idea of using ccache across runs to speed up the process, we complicate the workflow and add yet another thing to debug in case things break, we'll also need to either create one cache per host/build environmment (e.g. one for ubuntu-22.04 and another for ubuntu-24.04), or try to combine them, which may complicate things even further. Finally in order for this to survive across runs we'll need to upload the cache(s) as artifacts wasting storage resources. I checked out your approach and you upload a cache for each build configuration, this doesn't make much sense since the ccache is not used for the target binaries (libc/libgcc etc), but the host binaries. Also if we are going to use a persistent cache across runs it would be better to do it for both the submodule cache and ccache, we could even use the submodule commit hashes to invalidate them in case for example we update the compiler.

TShapinsky commented 1 week ago

I've been optimizing the approach a bit since then. The order of operations is

A single compile cache is downloaded to each worker.
ccache is configured to only use the preprocessor mode with the hash_dir option disabled. This allows cache to be shared between different targets as it ignores which folders things are in, it only looks at the hash of the output of the preprocessor.
Compile happens
ccache stats are checked and cache older than the compile start time is evicted.
If the cache hits are less than 95% (this is arbitrary) the cache is uploaded to an artifact. The artifact is only about 300MB and can have a retention time of 1 day. I don't see this as a misuse of resources.
A post processing job checks if any artifacts have been created, if so it combines with the original cache, cleans up to a max of 3GB, and saves the cache.

we could even use the submodule commit hashes to invalidate them in case for example we update the compiler.

My PR #1607 already does this, it just uses the hash of the git submodule command on the empty repo, which includes all of the different commit hashes.

In the case of any hashing, I think it's probably most effective to only have caches be generated by the master branch, as it is the only one whose cache can be accessed from any other branch.

@mickflemm have you seen this action? https://github.com/easimon/maximize-build-space it is a hack like what is currently being done, but can give you something like 60GB of build space at the cost of 2 seconds of run time. It's what I've been using in my tests.

TommyMurphyTM1234 commented 1 week ago

Although I like the idea of using ccache across runs to speed up the process, we complicate the workflow and add yet another thing to debug in case things break, we'll also need to either create one cache per host/build environmment (e.g. one for ubuntu-22.04 and another for ubuntu-24.04), or try to combine them, which may complicate things even further.

In case it matters/helps, there's another open PR that changes the CI to only build on ubuntu-latest rather than two specific LTS versions:

https://github.com/riscv-collab/riscv-gnu-toolchain/pull/1608#issuecomment-2452982418

@mickflemm Use ubuntu-latest 5ac342c It's better to use ubuntu-latest to track current LTS provided by github, instead of having to manually update it. It also makes more sense to stick to the current LTS than trying to support the older one.

mickflemm commented 1 week ago

I've been optimizing the approach a bit since then. The order of operations is

1. A single compile cache is downloaded to each worker.

2. ccache is configured to only use the preprocessor mode with the hash_dir option disabled. This allows cache to be shared between different targets as it ignores which folders things are in, it only looks at the hash of the output of the preprocessor.

3. Compile happens

4. ccache stats are checked and cache older than the compile start time is evicted.

5. If the cache hits are less than 95% (this is arbitrary) the cache is uploaded to an artifact. The artifact is only about 300MB and can have a retention time of 1 day. I don't see this as a misuse of resources.

6. A post processing job checks if any artifacts have been created, if so it combines with the original cache, cleans up to a max of 3GB, and saves the cache.

ACK will check it out, at this point persistent cache is further down on my todo list, I 'm not there yet.

we could even use the submodule commit hashes to invalidate them in case for example we update the compiler.

My PR #1607 already does this, it just uses the hash of the git submodule command on the empty repo, which includes all of the different commit hashes.

In the case of any hashing, I think it's probably most effective to only have caches be generated by the master branch, as it is the only one whose cache can be accessed from any other branch.

I'd prefer to just use the default branch so that the workflow can be triggered manually, it won't make much of a difference for upstream but it'd help when debugging. Also I'm still not sure the persistent cache idea is ideal, most of the commits in this repo are for bumping up submodule versions, so in most cases we'll be invalidating the cache anyway.

@mickflemm have you seen this action? https://github.com/easimon/maximize-build-space it is a hack like what is currently being done, but can give you something like 60GB of build space at the cost of 2 seconds of run time. It's what I've been using in my tests.

I've freed up to 56GB but it doesn't make much sense, we don't need this much space and we don't utilize /mnt at all, as for the speed of the process at this point most of the time is the rms, when I'm done I'll just wrap those in a script and let them run in the background, there is no reason for the CI to wait for this to finish it can continue the process asap. Regarding the action you mention I'd prefer to not add a dependency for something as simple as doing rm (or even apt remove), it'll be yet another thing to keep an eye for during maintenance.

mickflemm commented 1 week ago

Although I like the idea of using ccache across runs to speed up the process, we complicate the workflow and add yet another thing to debug in case things break, we'll also need to either create one cache per host/build environmment (e.g. one for ubuntu-22.04 and another for ubuntu-24.04), or try to combine them, which may complicate things even further.

In case it matters/helps, there's another open PR that changes the CI to only build on ubuntu-latest rather than two specific LTS versions:
* [Update build workflow #1608 (comment)](https://github.com/riscv-collab/riscv-gnu-toolchain/pull/1608#issuecomment-2452982418)
@mickflemm Use ubuntu-latest 5ac342c It's better to use ubuntu-latest to track current LTS provided by github, instead of having to manually update it. It also makes more sense to stick to the current LTS than trying to support the older one.

I reverted this though since github hasn't switched ubuntu-latest to ubuntu-24.04 yet, however I'm still in favor of this approach, it also affects the cleanup part since on each iteration they install different packages.

TommyMurphyTM1234 commented 1 week ago

I reverted this though since github hasn't switched ubuntu-latest to ubuntu-24.04 yet, however I'm still in favor of this approach, it also affects the cleanup part since on each iteration they install different packages.

Sorry - I failed to realise that it was actually another of your own commits! :-D

TShapinsky commented 1 week ago

ACK will check it out, at this point persistent cache is further down on my todo list, I 'm not there yet.

No hurry or worries. When it's ready I'll open a PR over here and you all can decide if it's something you want or not! :)

mickflemm commented 1 week ago

I added most of what we discussed here, along with some further fixes and optimizations in #1608, I also allowed the submodule cache to persist across runs based on the hash of current state of submodules as mentioned above. The build workflow works as expected, I also updated the nightly build workflow but it needs further updates since the create-release and upload-release-asset actions are deprecated / unmaintained, it works for now (with the PR applied) but it needs more work, I'll come back with another PR for it when I get some time. @cmuellner would you mind merging #1608 ?

mickflemm commented 6 days ago

BTW is it ok if we remove one of the two "nightly" parts from the filename, currently it's "nightly-tag-nightly.tar.gz" ?

As for the ccache approach @TShapinsky how about this: We have another job, not part of build / nightly, that we can trigger once a week or manually to generate a combined ccache for each os environment (combined: we just compile gcc, llvm, binutils etc to populate the cache, no install or anything), it should be less than 4G each, so we have two of them (or one if we just use ubuntu-latest as I suggested) and we save them as normal caches (not as artifacts).

Then the build / nightly process just restores the ccache for its os environment if found and uses it, else it's business as usual. Also if we see things breaking (which is a possibility with ccache), we remove the caches from the list of caches (if they are artifacts it would be more complicated). Even with two caches and the submodule cache, we should be less than 10G, so within the size for storing caches, and if one component changes we'll need to invalidate them all together so they are one batch (although it doesn't make sense to combine them since submodule cache is os independent and we'll duplicate the same thing). We can even have a workflow for init/update of all caches (including submodule cache), that handles rotating etc, and let the build workflow be only a consumer of the caches.

The more I think about using ccache on the build workflow, the more I don't like it, if the cache doesn't exist we need someone to populate it (so others depend on it), when populating the ccache the build process would take longer (since it'll be full of cache misses). If all jobs run in parallel (which is the desired scenario, and it happens often when runners are available), this is quite messy and results the workflow to take longer. Given that often the commits on this repo update submodules (hence the caches would be invalidated), we may not win any time at all, to the contrary we may slow things down with ccache (with the submodule it's different because it's the same across all configurations so it can easily be part of the build workflow). Your approach of having one ccache per configuration makes it less messy on one hand (it doesn't break parallelization, it would stall the workflow though in case the ccache is invalidated), on the other we polute the list of artifacts and there is a lot of duplication in there so it's a different kind of mess.

riscv-collab / riscv-gnu-toolchain

Nightly release CI action is broken #1591