Closed kaiyoma closed 2 years ago
Minimally we may want to remove the fallback to the javascript implementation of tar
, since the main difference is that the latter accepts invalid tarballs. This should then allow it to fall back to doing a normal build because the tar is invalid.
That seems reasonable to me. Was there a specific reason that this JS fallback was added? Generally speaking, tar
is a pretty reliable tool since it's been around for ages.
Similar to https://github.com/microsoft/rushstack/issues/2827, although in that case the build didn't fail -- it unpacked "enough" of the bad tarball and continued building, but would then fail downstream builds because files were missing.
In both cases, seems like the fix is detecting and deleting a corrupted cache entry.
We're occasionally seeing a similar issue, whereby there's a problem reading the cache entry (not related to tar
), and rush throws up its hands and gives up, instead of rebuilding:
12:58:21 ==[ schema-diff ]================================================[ 28 of 30 ]==
12:58:21 File does not exist: /home/jenkins/workspace/Geiger/geiger-review/common/temp/build-cache/736004bb74129990cc872d0fe458a70f667d8741.temp
12:58:21 ENOENT: no such file or directory, stat '/home/jenkins/workspace/Geiger/geiger-review/common/temp/build-cache/736004bb74129990cc872d0fe458a70f667d8741.temp'
12:58:21 "schema-diff" failed to build.
Again, it seems like rush should be smarter here. A rebuild isn't ideal of course, but it's a heck of a lot better than our entire CI job failing and us having to retrigger it from scratch.
Any movement or updates here? We hit this problem regularly and it seems like a fix wouldn't be that difficult (simply run a full build if the cache entry can't be restored) and would help out a lot.
FYI, we run into this problem multiple times a week. Somehow, an entry in our (shared) build cache gets corrupted, and causes all subsequent builds to fail like this:
14:27:32 ==[ @arista/cv-components ]======================================[ 33 of 33 ]==
14:30:54 File does not exist: /home/jenkins/workspace/Geiger/geiger-merge/common/temp/build-cache/42cf86f3b20735d83818fbf5e3b4aedb249ffeee.temp
14:30:54 ENOENT: no such file or directory, rename '/home/jenkins/workspace/Geiger/geiger-merge/common/temp/build-cache/42cf86f3b20735d83818fbf5e3b4aedb249ffeee.temp' -> '/home/jenkins/workspace/Geiger/geiger-merge/common/temp/build-cache/42cf86f3b20735d83818fbf5e3b4aedb249ffeee'
14:30:54 "@arista/cv-components" failed to build.
We have a tool for finding and deleting invalid cache entries, which we actually run daily, but this problem still pops up and derails our builds.
We have a tool for finding and deleting invalid cache entries, which we actually run daily, but this problem still pops up and derails our builds.
@kaiyoma If you aren't able to contribute a PR, could you provide simpler repro steps (that don't require configuring an NFS partition and exhausting its disk space)?
@octogonz We actually see this problem a lot without exhausting disk space. For some reason, Rush seems to write out corrupt entries periodically on its own.
Repro steps are pretty simple:
rush build
rush build
The build will fail because the cache entry is invalid, but a full build won't be done to fix the cache.
I'm not convinced this is completely fixed. We've upgraded to a version of Rush with this fix (5.81.0) and though the problem is better, we still see this a lot:
12:11:43 ==[ @arista/cv-components ]======================================[ 32 of 33 ]==
12:14:04 File does not exist: /home/jenkins/workspace/Geiger/geiger-merge/common/temp/build-cache/2e6ce6b868ae4a277db20f881fc977d34ccd6ad0.temp
12:14:04 ENOENT: no such file or directory, rename '/home/jenkins/workspace/Geiger/geiger-merge/common/temp/build-cache/2e6ce6b868ae4a277db20f881fc977d34ccd6ad0.temp' -> '/home/jenkins/workspace/Geiger/geiger-merge/common/temp/build-cache/2e6ce6b868ae4a277db20f881fc977d34ccd6ad0'
12:14:04 "@arista/cv-components" failed to build.
Re-running the build always passes, so it's clearly some temporary cache issue. My guess is that multiple builds are running simultaneously and trying to add cache entries at the same time. If this particular rename fails, Rush could check if the desired cache entry already exists, and if so, just move along without failing.
Summary
If you're using the rush build cache and an invalid entry gets written, it completely derails all future builds until the offending entry is manually removed.
Repro steps
rush build
. An empty or incomplete entry will be written, and the build will failrush build
againAt this point, all future builds will fail with this error:
The log file referenced above confirms that the cache entry isn't a valid tarball:
Details
Rush is correct in saying that the cache entry is invalid, but it seems like a pretty bad bug that it gives up on the entire build. I'm currently trying to implement a global network build cache, so this error is causing all CI tasks to fail, because every task encounters the same corrupt cache entry.
If Rush encounters an invalid entry, it should remove the entry and then follow the steps it normally would: run the full build for that package and then store a new entry in the cache. Otherwise, this error scenario requires human intervention, which is annoying, since it could be easily automated with smarter logic.
Standard questions
Please answer these questions to help us investigate your issue more quickly:
@microsoft/rush
globally installed version?rushVersion
from rush.json?node -v
)?