praetorian-inc / noseyparker

Nosey Parker is a command-line program that finds secrets and sensitive information in textual data and Git history.
Apache License 2.0
1.56k stars 77 forks source link

Download and patch Vectorscan sources instead of vendoring them #151

Closed seqre closed 3 months ago

seqre commented 3 months ago

As we discussed internally, I implemented a solution of downloading Vectorscan sources when missing and applying a patch file with our custom changes instead of permanently storing the patched version in the repository. Implements #71.

It works by downloading Vectorscan sources from GitHub archives, decompressing, and unarchiving to disk. Then, it applies our custom .patch file. At the end of each phase, it writes relevant data into a .version file to keep track of completed changes so they don't need to be repeated on each build. I'm aware it's pretty simple validation, but it works pretty well.

seqre commented 3 months ago

The Docker image builds just fine on Debian, but I have no idea why the Alpine version stopped working out of the blue.

  = note: /usr/lib/gcc/x86_64-alpine-linux-musl/12.2.1/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lssl: No such file or directory
          /usr/lib/gcc/x86_64-alpine-linux-musl/12.2.1/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lcrypto: No such file or directory
          /usr/lib/gcc/x86_64-alpine-linux-musl/12.2.1/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lz: No such file or directory
          collect2: error: ld returned 1 exit status

My best bet by looking at the error is the assumption that libraries responsible for handling .tar.gz (flate2 and tar) might require some additional system libraries, but their documentation doesn't mention anything. I tried installing a bunch of different dev packages in the container, but it didn't change anything.

bradlarsen commented 3 months ago

I'll take a look at it later. In my experience, that sort of failure is business as usual when dealing with C++ dependencies...

seqre commented 3 months ago

Just as a sanity check, I changed the code to download the ZIP file and extract it using zip_extract to see if there would be any difference (it's actually cleaner solution tbh, it doesn't require renaming extracted directory).

let response = reqwest::blocking::get(format!("https://github.com/VectorCamp/vectorscan/archive/refs/tags/vectorscan/{VERSION}.zip")).expect("Could not download Vectorscan source files");
let bytes = response.bytes().expect("Could not read Vectorscan zip file");
zip_extract::extract(Cursor::new(bytes), &vectorscan_src_dir, true).expect("Could not extract Vectorscan source files");

Unfortunately, it ended up being even worse as it couldn't find more libraries (with me attempting to download those):

36.00   = note: /usr/lib/gcc/x86_64-alpine-linux-musl/13.2.1/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lbz2: No such file or directory
36.00           /usr/lib/gcc/x86_64-alpine-linux-musl/13.2.1/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lssl: No such file or directory
36.00           /usr/lib/gcc/x86_64-alpine-linux-musl/13.2.1/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lcrypto: No such file or directory
36.00           /usr/lib/gcc/x86_64-alpine-linux-musl/13.2.1/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find -lz: No such file or directory
36.00           collect2: error: ld returned 1 exit status
bradlarsen commented 3 months ago

Yes, this behavior is odd. One would expect that extracting a tarball and applying the patches would give the same behavior as the current vendored approach, as the vectorscan source directory trees should be the same in either case.

I will take a closer look at this PR this week.