microsoft / rushstack

Monorepo for tools developed by the Rush Stack community
https://rushstack.io/
Other
5.89k stars 595 forks source link

[rush] rush build fails and doesn't recover if a cache entry is invalid #3454

Closed kaiyoma closed 2 years ago

kaiyoma commented 2 years ago

Summary

If you're using the rush build cache and an invalid entry gets written, it completely derails all future builds until the offending entry is manually removed.

Repro steps

  1. Enable the local build cache
  2. Connect the cache directory to an external volume (e.g. NFS mount)
  3. Fill up the volume (maybe leave a tiny bit of room)
  4. Run rush build. An empty or incomplete entry will be written, and the build will fail
  5. Run rush build again

At this point, all future builds will fail with this error:

13:02:02  "tar" exited with code 2 while attempting to restore cache entry. Rush will attempt to extract from the cache entry with a JavaScript implementation of tar. See "/home/jenkins/workspace/Geiger/geiger-review/packages/cv-tools/.rush/temp/b61388c431880616de380c9fd2040232dcb886b8.log" for logs from the tar process.

The log file referenced above confirms that the cache entry isn't a valid tarball:

Start time: Fri Jun 03 2022 20:02:02 GMT+0000 (Coordinated Universal Time)
Invoking "/bin/tar -x -m -f /home/jenkins/workspace/Geiger/geiger-review/common/temp/build-cache/b61388c431880616de380c9fd2040232dcb886b8"

======= BEGIN PROCESS INPUT ======

======== END PROCESS INPUT =======
======= BEGIN PROCESS OUTPUT =======
[stderr] /bin/tar: This does not look like a tar archive
[stderr] /bin/tar: Exiting with failure status due to previous errors
======== END PROCESS OUTPUT ========

Exited with code "2"

Details

Rush is correct in saying that the cache entry is invalid, but it seems like a pretty bad bug that it gives up on the entire build. I'm currently trying to implement a global network build cache, so this error is causing all CI tasks to fail, because every task encounters the same corrupt cache entry.

If Rush encounters an invalid entry, it should remove the entry and then follow the steps it normally would: run the full build for that package and then store a new entry in the cache. Otherwise, this error scenario requires human intervention, which is annoying, since it could be easily automated with smarter logic.

Standard questions

Please answer these questions to help us investigate your issue more quickly:

Question Answer
@microsoft/rush globally installed version? 5.68.2
rushVersion from rush.json? 5.68.2
Operating system? Linux
Would you consider contributing a PR? No
Node.js version (node -v)? 16.13.0
dmichon-msft commented 2 years ago

Minimally we may want to remove the fallback to the javascript implementation of tar, since the main difference is that the latter accepts invalid tarballs. This should then allow it to fall back to doing a normal build because the tar is invalid.

kaiyoma commented 2 years ago

That seems reasonable to me. Was there a specific reason that this JS fallback was added? Generally speaking, tar is a pretty reliable tool since it's been around for ages.

elliot-nelson commented 2 years ago

Similar to https://github.com/microsoft/rushstack/issues/2827, although in that case the build didn't fail -- it unpacked "enough" of the bad tarball and continued building, but would then fail downstream builds because files were missing.

In both cases, seems like the fix is detecting and deleting a corrupted cache entry.

kaiyoma commented 2 years ago

We're occasionally seeing a similar issue, whereby there's a problem reading the cache entry (not related to tar), and rush throws up its hands and gives up, instead of rebuilding:

12:58:21  ==[ schema-diff ]================================================[ 28 of 30 ]==
12:58:21  File does not exist: /home/jenkins/workspace/Geiger/geiger-review/common/temp/build-cache/736004bb74129990cc872d0fe458a70f667d8741.temp
12:58:21  ENOENT: no such file or directory, stat '/home/jenkins/workspace/Geiger/geiger-review/common/temp/build-cache/736004bb74129990cc872d0fe458a70f667d8741.temp'
12:58:21  "schema-diff" failed to build.

Again, it seems like rush should be smarter here. A rebuild isn't ideal of course, but it's a heck of a lot better than our entire CI job failing and us having to retrigger it from scratch.

kaiyoma commented 2 years ago

Any movement or updates here? We hit this problem regularly and it seems like a fix wouldn't be that difficult (simply run a full build if the cache entry can't be restored) and would help out a lot.

kaiyoma commented 2 years ago

FYI, we run into this problem multiple times a week. Somehow, an entry in our (shared) build cache gets corrupted, and causes all subsequent builds to fail like this:

14:27:32  ==[ @arista/cv-components ]======================================[ 33 of 33 ]==
14:30:54  File does not exist: /home/jenkins/workspace/Geiger/geiger-merge/common/temp/build-cache/42cf86f3b20735d83818fbf5e3b4aedb249ffeee.temp
14:30:54  ENOENT: no such file or directory, rename '/home/jenkins/workspace/Geiger/geiger-merge/common/temp/build-cache/42cf86f3b20735d83818fbf5e3b4aedb249ffeee.temp' -> '/home/jenkins/workspace/Geiger/geiger-merge/common/temp/build-cache/42cf86f3b20735d83818fbf5e3b4aedb249ffeee'
14:30:54  "@arista/cv-components" failed to build.

We have a tool for finding and deleting invalid cache entries, which we actually run daily, but this problem still pops up and derails our builds.

octogonz commented 2 years ago

We have a tool for finding and deleting invalid cache entries, which we actually run daily, but this problem still pops up and derails our builds.

@kaiyoma If you aren't able to contribute a PR, could you provide simpler repro steps (that don't require configuring an NFS partition and exhausting its disk space)?

kaiyoma commented 2 years ago

@octogonz We actually see this problem a lot without exhausting disk space. For some reason, Rush seems to write out corrupt entries periodically on its own.

Repro steps are pretty simple:

  1. rush build
  2. Manually corrupt one of the cache entries
  3. rush build

The build will fail because the cache entry is invalid, but a full build won't be done to fix the cache.

kaiyoma commented 1 year ago

I'm not convinced this is completely fixed. We've upgraded to a version of Rush with this fix (5.81.0) and though the problem is better, we still see this a lot:

12:11:43  ==[ @arista/cv-components ]======================================[ 32 of 33 ]==
12:14:04  File does not exist: /home/jenkins/workspace/Geiger/geiger-merge/common/temp/build-cache/2e6ce6b868ae4a277db20f881fc977d34ccd6ad0.temp
12:14:04  ENOENT: no such file or directory, rename '/home/jenkins/workspace/Geiger/geiger-merge/common/temp/build-cache/2e6ce6b868ae4a277db20f881fc977d34ccd6ad0.temp' -> '/home/jenkins/workspace/Geiger/geiger-merge/common/temp/build-cache/2e6ce6b868ae4a277db20f881fc977d34ccd6ad0'
12:14:04  "@arista/cv-components" failed to build.

Re-running the build always passes, so it's clearly some temporary cache issue. My guess is that multiple builds are running simultaneously and trying to add cache entries at the same time. If this particular rename fails, Rush could check if the desired cache entry already exists, and if so, just move along without failing.