Sporadic test failure: `scan::appmaker::scan_workflow_from_git_url`

bradlarsen commented 5 months ago

Describe the bug Sometimes, rarely, the scan::appmaker::scan_workflow_from_git_url test case fails.

To Reproduce 🥴

Expected behavior The test case should always succeed

Actual behavior See this CI run log:

failures:

---- scan::appmaker::scan_workflow_from_git_url stdout ----
thread 'scan::appmaker::scan_workflow_from_git_url' panicked at /rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/ops/function.rs:250:5:
Unexpected stdout, failed var.is_match((?m)^Scanned (549.97|550.05) MiB from 7,92[68] blobs in .*; 23/23 new matches$)
├── var: Scanned 550.05 MiB from 7,927 blobs in 1 second (729.58 MiB/s); 23/23 new matches
│   
│    Rule                         Total Findings   Total Matches 
│   ─────────────────────────────────────────────────────────────
│    AWS API Key                               3               3 
│    AWS S3 Bucket (path style)                3              13 
│    Amazon Resource Name                      3               3 
│    Generic Secret                            3               3 
│    AWS Secret Access Key                     1               1 
│   
│   Run the `report` command next to show finding details.
└── var as str: Scanned 550.05 MiB from 7,927 blobs in 1 second (729.58 MiB/s); 23/23 new matches

     Rule                         Total Findings   Total Matches 
    ─────────────────────────────────────────────────────────────
     AWS API Key                               3               3 
     AWS S3 Bucket (path style)                3              13 
     Amazon Resource Name                      3               3 
     Generic Secret                            3               3 
     AWS Secret Access Key                     1               1 

    Run the `report` command next to show finding details.

command=`"/home/runner/work/noseyparker/noseyparker/target/release/noseyparker-cli" "scan" "--datastore=/tmp/.tmpexTCIC/datastore" "--git-url=https://github.com/mozilla-appmaker/appmaker" "--ruleset=all"`
code=0
stdout=\```
Scanned 550.05 MiB from 7,927 blobs in 1 second (729.58 MiB/s); 23/23 new matches
error: test failed, to rerun pass `-p noseyparker-cli --test test_noseyparker`

 Rule                         Total Findings   Total Matches 
─────────────────────────────────────────────────────────────
 AWS API Key                               3               3 
 AWS S3 Bucket (path style)                3              13 
 Amazon Resource Name                      3               3 
 Generic Secret                            3               3 
 AWS Secret Access Key                     1               1 

Run the `report` command next to show finding details.
\```

stderr="2024-04-26T16:38:47.0[527](https://github.com/praetorian-inc/noseyparker/actions/runs/8851458936/job/24308038487#step:9:528)03Z ERROR noseyparker_cli::cmd_scan: Failed to read blob 803c41c9e0c0bd42ef1d430001f994af24a7c535 from Git repository at /tmp/.tmpexTCIC/datastore/clones/https/github.com/mozilla-appmaker/appmaker: An object with id 803c41c9e0c0bd42ef1d430001f994af24a7c535 could not be found\n"

stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: assert_cmd::assert::AssertError::panic::panic_cold_display
   3: core::ops::function::FnOnce::call_once
   4: test_noseyparker::scan::appmaker::scan_workflow_from_git_url
   5: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

failures:
    scan::appmaker::scan_workflow_from_git_url

test result: FAILED. 70 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 7.60s

Error: Process completed with exit code 101.

Output of noseyparker --version

noseyparker 0.18.0-dev

Build Configuration:

    Build Timestamp:    2024-04-26T16:38:12.626689961Z

    Commit Timestamp:   2024-04-26T12:30:46.000000000-04:00
    Commit Branch:      main
    Commit SHA:         0cce95910be38058cc0cf00b548a892e3ed5dc35

    Cargo Features:     default,disable_trace,log,mimalloc,release
    Debug:              false
    Optimization:       3
    Target Triple:      x86_64-unknown-linux-gnu

Build System:

    OS:                 Ubuntu
    OS Version:         Linux 20.04 Ubuntu

    CPU Vendor:         AuthenticAMD
    CPU Brand:          AMD EPYC 7763 64-Core Processor
    CPU Cores:          2

    rustc Version:      1.77.2
    rustc Channel:      stable
    rustc Host Triple:  x86_64-unknown-linux-gnu
    rustc Commit Date:  2024-04-09
    rustc Commit SHA:   25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04
    rustc LLVM Version: 17.0

Additional context I think I've only ever seen this in GitHub Actions CI, though I'm not certain.

I suspect there is some race condition in the gix library.

bradlarsen commented 5 months ago

I was able to reproduce locally!

I ran the following in a loop on a 14-core M3 MacBook Pro, and after 99 attempts, saw a similar error message:

$ i=0; while true; do ((i = i + 1)) && echo "### take2.$i ###" && rm -rf appmaker.np && cargo run -F release -r -q -- scan -d appmaker.np appmaker.git -j20; done
...
### take2.99 ###
2024-04-26T17:05:48.307767Z ERROR noseyparker_cli::cmd_scan: Failed to read blob 0413e897ddb651d58e86dc2363c631b6b7909820 from Git repository at appmaker.git: An object with id 0413e897ddb651d58e86dc2363c631b6b7909820 could not be found
Scanned 549.96 MiB from 7,925 blobs in 0 seconds (2.09 GiB/s); 4/4 new matches

 Rule                    Total Findings   Total Matches 
────────────────────────────────────────────────────────
 Generic Secret                       3               3 
 AWS Secret Access Key                1               1 

Run the `report` command next to show finding details.
...

bradlarsen commented 5 months ago

This is difficult to reproduce. I've had to run hundreds or thousands of times in a loop to see that error message again.

bradlarsen commented 5 months ago

Interestingly, the appmaker.git clone — which I obtained with git clone --bare https://github.com/mozilla-appmaker/appmaker appmaker.git — contains only a single packfile:

% tree appmaker.git
appmaker.git
├── HEAD
├── config
├── description
├── hooks
│   ├── applypatch-msg.sample
│   ├── commit-msg.sample
│   ├── fsmonitor-watchman.sample
│   ├── post-update.sample
│   ├── pre-applypatch.sample
│   ├── pre-commit.sample
│   ├── pre-merge-commit.sample
│   ├── pre-push.sample
│   ├── pre-rebase.sample
│   ├── pre-receive.sample
│   ├── prepare-commit-msg.sample
│   ├── push-to-checkout.sample
│   └── update.sample
├── info
│   └── exclude
├── objects
│   ├── info
│   └── pack
│       ├── pack-fd9127ec007dda804dc00ea7a61b1cfa0ba55921.idx
│       └── pack-fd9127ec007dda804dc00ea7a61b1cfa0ba55921.pack
├── packed-refs
└── refs
    ├── heads
    └── tags

9 directories, 20 files

The version of git on my laptop, where I've been able to reproduce very rarely:

$ git --version
git version 2.39.3 (Apple Git-146)

bradlarsen commented 4 months ago

@Byron FYI, there appears to be a race condition somewhere in gix, either its reading code or a caching layer somewhere, which I can intermittently trigger in Nosey Parker.

Specifically, when scanning a Git repository, Nosey Parker first enumerates all object IDs, building a vector of blob IDs. Then in a later stage, that vector of blob IDs is used in a Rayon pipeline to read each blob and scan it, in parallel. At the rate of around one in every few hundred attempts when using multiple worker threads, I see gix failing to read one of those blob objects whose ID was previously enumerated.

Last week I tried isolating where the behavior is coming from by trying to disable all caching mechanisms in gix, but that didn't make the problem go away. Either it's in the core ODB reading code, or (more likely) I failed to effectively disable all of the gix caching mechanisms.

This is not a super high priority Nosey Parker issue for me right now, but I wanted to make you aware of it. I'm happy to do further investigation if you have suggestions to better debug this. Thank you!

Byron commented 4 months ago

Thanks so much for letting me know! Bugs in the ODB portion of the gix codebase are among my most feared as they will be spurious, hard to reproduce, and typically hard to fix (after all, who says they are not just much less likely then?).

I agree that it's most definitely a race condition where somehow a thread won't notice that a new pack-index file has become available that contains the object, probably because another thread has just loaded it.

My plan here is to write a program to hopefully reproduce the issue, ideally with a repository with a lot of pack files to make its occurrence more likely. From there, it should be possible to fix it or at least improve the situation.

bradlarsen commented 4 months ago

@Byron Thank you! I could look at this more next week. Let me know if I can help.

Byron commented 4 months ago

Thank you! I am slowly getting back into it and noticed something interesting: The code to get the initial set of objects also uses a parallel implementation that calls odb.header(), which is when multiple threads trigger loading all the packs (racily) right after the indices have been pre-loaded to support the iteration. From what I can tell, this never hits a race-condition, probably because indices are pre-loaded here.

After the initial iteration, the whole ODB would be loaded/memmapped, indices and packs, but here it seems the repository is re-opened so it starts at zero. Now find_object() runs into the problem occasionally. What I am saying is that in theory, re-using the same repository instance that was used for counting could possibly fix the issue.

Chances are that the underlying issue in somewhere in the code that dynamically loads indices, as all of the sudden it fails to find an index with the object, at least in one thread (maybe another did successfully load it, or is in the process of doing so and somehow one thread doesn't get to wait for that).

Hopefully I will have a repro soon.

bradlarsen commented 4 months ago

After the initial iteration, the whole ODB would be loaded/memmapped, indices and packs, but here it seems the repository is re-opened so it starts at zero.

Yes, you're right — Nosey Parker enumerates repos in one phase, and then in the next phase reopens each repo and reads the blobs from it. This re-opening is done to avoid running out of memory, as otherwise it might need to keep thousands of repos open at once.

I intend to rework this scheme eventually, to start scanning immediately rather than first enumerating everything and then scanning it. But that's a medium-term project for me.

Byron commented 4 months ago

Great news! I could reproduce the issue and can do so consistently. Having less packs seems to be the key here, and from what I can tell, this situation will put threads into a waiting-loop and it's my guess that something there goes wrong so they don't actually pickup newly loaded indices.

❯ hyperfine -m100 './target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup' --show-output
Benchmark 1: ./target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup
 18:04:54 counting done 41.4k objects in 0.01s (4.4M objects/s)
 18:04:54  tracing ERROR    🚨 [error]: Object that is known to be present wasn't found | err: NotFound { oid: Sha1(2558c83eb8f313829b7573f70081c2c7ab1581c4) }
 18:04:54 re-counting done 41.4k objects in 0.10s (394.8k objects/s)
 18:04:54     tracing INFO     run [ 116ms | 98.39% / 100.00% ]
 18:04:54     tracing INFO     ┝━ ThreadSafeRepository::discover() [ 1.59ms | 0.09% / 1.37% ]
 18:04:54     tracing INFO     │  ┕━ open_from_paths() [ 1.48ms | 1.22% / 1.28% ]
 18:04:54     tracing INFO     │     ┕━ gix_odb::Store::at() [ 70.0µs | 0.06% ]
 18:04:54     tracing INFO     ┕━ ThreadSafeRepository::open() [ 277µs | 0.06% / 0.24% ]
 18:04:54     tracing INFO        ┕━ open_from_paths() [ 208µs | 0.15% / 0.18% ]
 18:04:54     tracing INFO           ┕━ gix_odb::Store::at() [ 38.2µs | 0.03% ]
Only JSON is implemented - using that instead
Error: At least one object couldn't be looked up even though it must exist
Error: Command terminated with non-zero exit code: 1. Use the '-i'/'--ignore-failure' option if you want to ignore this. Alternatively, use the '--show-output' option to debug what went wrong.

Byron commented 4 months ago

With the latest modification, I cannot reproduce the issue anymore even after a thousand runs, where previously it would trigger after less than 50 iterations usually.

❯ hyperfine -m1000 './target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup'
Benchmark 1: ./target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup
  Time (mean ± σ):     124.9 ms ±   2.2 ms    [User: 157.1 ms, System: 15.8 ms]
  Range (min … max):   116.3 ms … 137.4 ms    1000 runs

The only unfortunate thing here is that I wasn't able to fix it in a way that avoids re-reading the state on disk to pick up newly added packs, which shouldn't be needed here. On the bright side, it's quite a rare issue in the first place so this inefficiency probably doesn't matter much for most.

After a little more 'tuning' I think this issue is a thing of the past, as it can't be reproduced anymore even in 10.000 runs.

❯ hyperfine -m10000 './target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup'
Benchmark 1: ./target/release/gix -r boto -t10 --trace odb stats --extra-header-lookup
  Time (mean ± σ):     130.4 ms ±   2.5 ms    [User: 165.3 ms, System: 16.6 ms]
  Range (min … max):   113.8 ms … 173.8 ms    10000 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

It's not impossible that there will be more races that are discovered in long-running servers maybe, but we get there when we get there - the trampling-herd problem here is now certainly resolved.

bradlarsen commented 4 months ago

Thank you @Byron for the speedy response! I will give the main branch of gitoxide a try in Nosey Parker.

Byron commented 4 months ago

Gix v0.63 was just released, and it should fix this issue for good :).

bradlarsen commented 4 months ago

Yes, upgrading gix from 0.62 to 0.63 seems to fix this problem for me. Thank you @Byron!

praetorian-inc / noseyparker

Sporadic test failure: `scan::appmaker::scan_workflow_from_git_url` #179