Closed bradlarsen closed 4 months ago
I was able to reproduce locally!
I ran the following in a loop on a 14-core M3 MacBook Pro, and after 99 attempts, saw a similar error message:
$ i=0; while true; do ((i = i + 1)) && echo "### take2.$i ###" && rm -rf appmaker.np && cargo run -F release -r -q -- scan -d appmaker.np appmaker.git -j20; done
...
### take2.99 ###
2024-04-26T17:05:48.307767Z ERROR noseyparker_cli::cmd_scan: Failed to read blob 0413e897ddb651d58e86dc2363c631b6b7909820 from Git repository at appmaker.git: An object with id 0413e897ddb651d58e86dc2363c631b6b7909820 could not be found
Scanned 549.96 MiB from 7,925 blobs in 0 seconds (2.09 GiB/s); 4/4 new matches
Rule Total Findings Total Matches
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Generic Secret 3 3
AWS Secret Access Key 1 1
Run the `report` command next to show finding details.
...
This is difficult to reproduce. I've had to run hundreds or thousands of times in a loop to see that error message again.
Interestingly, the appmaker.git
clone โ which I obtained with git clone --bare https://github.com/mozilla-appmaker/appmaker appmaker.git
โ contains only a single packfile:
% tree appmaker.git
appmaker.git
โโโ HEAD
โโโ config
โโโ description
โโโ hooks
โย ย โโโ applypatch-msg.sample
โย ย โโโ commit-msg.sample
โย ย โโโ fsmonitor-watchman.sample
โย ย โโโ post-update.sample
โย ย โโโ pre-applypatch.sample
โย ย โโโ pre-commit.sample
โย ย โโโ pre-merge-commit.sample
โย ย โโโ pre-push.sample
โย ย โโโ pre-rebase.sample
โย ย โโโ pre-receive.sample
โย ย โโโ prepare-commit-msg.sample
โย ย โโโ push-to-checkout.sample
โย ย โโโ update.sample
โโโ info
โย ย โโโ exclude
โโโ objects
โย ย โโโ info
โย ย โโโ pack
โย ย โโโ pack-fd9127ec007dda804dc00ea7a61b1cfa0ba55921.idx
โย ย โโโ pack-fd9127ec007dda804dc00ea7a61b1cfa0ba55921.pack
โโโ packed-refs
โโโ refs
โโโ heads
โโโ tags
9 directories, 20 files
The version of git
on my laptop, where I've been able to reproduce very rarely:
$ git --version
git version 2.39.3 (Apple Git-146)
@Byron FYI, there appears to be a race condition somewhere in gix
, either its reading code or a caching layer somewhere, which I can intermittently trigger in Nosey Parker.
Specifically, when scanning a Git repository, Nosey Parker first enumerates all object IDs, building a vector of blob IDs. Then in a later stage, that vector of blob IDs is used in a Rayon pipeline to read each blob and scan it, in parallel. At the rate of around one in every few hundred attempts when using multiple worker threads, I see gix
failing to read one of those blob objects whose ID was previously enumerated.
Last week I tried isolating where the behavior is coming from by trying to disable all caching mechanisms in gix
, but that didn't make the problem go away. Either it's in the core ODB reading code, or (more likely) I failed to effectively disable all of the gix
caching mechanisms.
This is not a super high priority Nosey Parker issue for me right now, but I wanted to make you aware of it. I'm happy to do further investigation if you have suggestions to better debug this. Thank you!
Thanks so much for letting me know!
Bugs in the ODB portion of the gix
codebase are among my most feared as they will be spurious, hard to reproduce, and typically hard to fix (after all, who says they are not just much less likely then?).
I agree that it's most definitely a race condition where somehow a thread won't notice that a new pack-index file has become available that contains the object, probably because another thread has just loaded it.
My plan here is to write a program to hopefully reproduce the issue, ideally with a repository with a lot of pack files to make its occurrence more likely. From there, it should be possible to fix it or at least improve the situation.
@Byron Thank you! I could look at this more next week. Let me know if I can help.
Thank you! I am slowly getting back into it and noticed something interesting: The code to get the initial set of objects also uses a parallel implementation that calls odb.header()
, which is when multiple threads trigger loading all the packs (racily) right after the indices have been pre-loaded to support the iteration. From what I can tell, this never hits a race-condition, probably because indices are pre-loaded here.
After the initial iteration, the whole ODB would be loaded/memmapped, indices and packs, but here it seems the repository is re-opened so it starts at zero. Now find_object()
runs into the problem occasionally. What I am saying is that in theory, re-using the same repository instance that was used for counting could possibly fix the issue.
Chances are that the underlying issue in somewhere in the code that dynamically loads indices, as all of the sudden it fails to find an index with the object, at least in one thread (maybe another did successfully load it, or is in the process of doing so and somehow one thread doesn't get to wait for that).
Hopefully I will have a repro soon.
After the initial iteration, the whole ODB would be loaded/memmapped, indices and packs, but here it seems the repository is re-opened so it starts at zero.
Yes, you're right โ Nosey Parker enumerates repos in one phase, and then in the next phase reopens each repo and reads the blobs from it. This re-opening is done to avoid running out of memory, as otherwise it might need to keep thousands of repos open at once.
I intend to rework this scheme eventually, to start scanning immediately rather than first enumerating everything and then scanning it. But that's a medium-term project for me.
Great news! I could reproduce the issue and can do so consistently. Having less packs seems to be the key here, and from what I can tell, this situation will put threads into a waiting-loop and it's my guess that something there goes wrong so they don't actually pickup newly loaded indices.
โฏ hyperfine -m100 './target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup' --show-output
Benchmark 1: ./target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup
18:04:54 counting done 41.4k objects in 0.01s (4.4M objects/s)
18:04:54 tracing ERROR ๐จ [error]: Object that is known to be present wasn't found | err: NotFound { oid: Sha1(2558c83eb8f313829b7573f70081c2c7ab1581c4) }
18:04:54 re-counting done 41.4k objects in 0.10s (394.8k objects/s)
18:04:54 tracing INFO run [ 116ms | 98.39% / 100.00% ]
18:04:54 tracing INFO โโ ThreadSafeRepository::discover() [ 1.59ms | 0.09% / 1.37% ]
18:04:54 tracing INFO โ โโ open_from_paths() [ 1.48ms | 1.22% / 1.28% ]
18:04:54 tracing INFO โ โโ gix_odb::Store::at() [ 70.0ยตs | 0.06% ]
18:04:54 tracing INFO โโ ThreadSafeRepository::open() [ 277ยตs | 0.06% / 0.24% ]
18:04:54 tracing INFO โโ open_from_paths() [ 208ยตs | 0.15% / 0.18% ]
18:04:54 tracing INFO โโ gix_odb::Store::at() [ 38.2ยตs | 0.03% ]
Only JSON is implemented - using that instead
Error: At least one object couldn't be looked up even though it must exist
Error: Command terminated with non-zero exit code: 1. Use the '-i'/'--ignore-failure' option if you want to ignore this. Alternatively, use the '--show-output' option to debug what went wrong.
With the latest modification, I cannot reproduce the issue anymore even after a thousand runs, where previously it would trigger after less than 50 iterations usually.
โฏ hyperfine -m1000 './target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup'
Benchmark 1: ./target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup
Time (mean ยฑ ฯ): 124.9 ms ยฑ 2.2 ms [User: 157.1 ms, System: 15.8 ms]
Range (min โฆ max): 116.3 ms โฆ 137.4 ms 1000 runs
The only unfortunate thing here is that I wasn't able to fix it in a way that avoids re-reading the state on disk to pick up newly added packs, which shouldn't be needed here. On the bright side, it's quite a rare issue in the first place so this inefficiency probably doesn't matter much for most.
After a little more 'tuning' I think this issue is a thing of the past, as it can't be reproduced anymore even in 10.000 runs.
โฏ hyperfine -m10000 './target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup'
Benchmark 1: ./target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup
Time (mean ยฑ ฯ): 130.4 ms ยฑ 2.5 ms [User: 165.3 ms, System: 16.6 ms]
Range (min โฆ max): 113.8 ms โฆ 173.8 ms 10000 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
It's not impossible that there will be more races that are discovered in long-running servers maybe, but we get there when we get there - the trampling-herd problem here is now certainly resolved.
Thank you @Byron for the speedy response! I will give the main branch of gitoxide a try in Nosey Parker.
Gix v0.63 was just released, and it should fix this issue for good :).
Yes, upgrading gix from 0.62 to 0.63 seems to fix this problem for me. Thank you @Byron!
Describe the bug Sometimes, rarely, the
scan::appmaker::scan_workflow_from_git_url
test case fails.To Reproduce ๐ฅด
Expected behavior The test case should always succeed
Actual behavior See this CI run log:
Output of
noseyparker --version
Additional context I think I've only ever seen this in GitHub Actions CI, though I'm not certain.
I suspect there is some race condition in the
gix
library.