Add commits range for `scan` in Git repositories

Coruscant11 commented 1 year ago

Hi :wave:

A great option in secret scanner is to be able to scan a range of commits, for example by adding an option to scan.

In my case, we use scanners for very large repositories. Once reported, in futures runs there will be no need to scan previously scanned commits. Only new commits are relevant. It saves a lot of time in large repositories.

Gitleaks has this feature , and Trufflehog too.

For example a since_commits option, scanning between a specific commit and HEAD. And why not a until_commits option.

Do you see any blocking issues for this enhancement?

:smile:

bradlarsen commented 1 year ago

Hi @Coruscant11. This is a good use case and a feature that would be nice to have.

One challenge with implementing this in Nosey Parker is that Git repos are not scanned commit-by-commit, but instead, all blobs found in the repository are scanned. (This Git scanning technique uncovers more things than going commit-by-commit.)

To add this feature to Nosey Parker, we would need to add some alternative Git enumeration mechanism that would walk commit-by-commit and only select blobs reachable from the desired set of commits. The current source for Git repo enumeration is here.

Another thing to consider is the CLI for this added feature. noseyparker scan currently takes a list of paths as inputs; these paths can be files or directories. Would the --since_commits COMMIT and --until_commits COMMIT options apply to all the specified paths? It might be better to extend the newly-added --git-url URL input specifier to accept not just an HTTPS URL, but additionally with a Git revision specifier. So it might look something like --git-url https://github.com/praetorian-inc/noseyparker.git@dae86e19..HEAD.

bradlarsen commented 1 year ago

Another related change to this that I'd like to make in Nosey Parker is to keep track of which inputs have already been scanned, and avoid rescanning them if possible.

Currently, noseyparker scan -d DATASTORE INPUT will completely enumerate and scan INPUT from scratch. Nosey Parker is fast, but for large repositories (like the LInux kernel, with 100+GB of blobs), it still takes a couple minutes. However, simply enumerating contents goes pretty quickly, especially in Git repositories (e.g., the Linux kernel repo can be enumerated in 13-25 seconds, depending on filesystem cache). If Nosey Parker kept track of which blobs it had scanned and with which set of rules, it could avoid re-scanning things.

I'm going to make a separate issue for this.

bradlarsen commented 1 year ago

Coruscant11 commented 1 year ago

One challenge with implementing this in Nosey Parker is that Git repos are not scanned commit-by-commit, but instead, all blobs found in the repository are scanned. (This Git scanning technique uncovers more things than going commit-by-commit.)

That is what I thought. With other scanners which takes the commits by commits way, some repos can take few hours to scan while noseyparker took only 15 seconds. The purpose of this issue is to save time, but if the scanner is that fast, it is not necessarily worth to implement this issue very quickly.

But even so, a feature to scan specific revision would be very nice I think! And why not specify a datastore as you said in order to not duplicate scans. For the rare people which are working on insanely huge repositories :smile:

For the git revision scan, here is my personal use case :

You scan the whole repository
You fetch all commits hashes list
Save the commits list somewhere in order to keep the history
Maybe one week later, you fetch only newest commits hashes, and make noseyparker scan all newest git revisions.

Datastore are nice, but I think also that in some cases you do not want to rely too much on that, for example on CI/CD when you do not know where can your program run. That is what I am doing at work, I have a very tiny API which has the role to save only commits scan history, but not secrets.

But this scanner seems so fast that it become a way more tiny problem.

I had a question, in some repositories, the scanner will found the same amount of distinct match at every run but not the same amount of total matches. Do you know why ? I do not think that it is an issue but I was wondering why.

Eitherway, the scan method of noseyparker seems very awesome. Very fast, and as you said, discover way more things. :smile:

bradlarsen commented 1 year ago

I had a question, in some repositories, the scanner will found the same amount of distinct match at every run but not the same amount of total matches. Do you know why ? I do not think that it is an issue but I was wondering why.

I think you're talking about the summary table? For example, from scanning Nosey Parker's repo itself, you get something like this:

 Rule                                                      Distinct Matches   Total Matches
────────────────────────────────────────────────────────────────────────────────────────────
 PEM-Encoded Private Key                                                 76             276
 bcrypt Hash                                                             32             226
 Generic API Key                                                         25             131
 md5crypt Hash                                                           23             953
 Generic Secret                                                          23             245
 AWS API Key                                                             17              95
 Microsoft Teams Webhook                                                 12              12
 Credentials in PsExec                                                   11              12
 Azure App Configuration Connection String                               10              36

The numbers here for each rule indicate how many times that rule matched across all the scanned inputs.

Distinct Matches is the number of distinct groups extracted from the rule's regex (e.g., 951bc382db9abad29c68634761dd6e19 from the input - 'API_KEY = "951bc382db9abad29c68634761dd6e19"' for Generic API Key). This number is more representative of the number of unique things found from scanning.

Total Matches, in contrast, is simply the total number of times that rule matched across all the scanned inputs, without any concern for the content of regex groups. If some secret appears in 10 different files, those will all be included in Total Matches, even though they are all the same.

Distinct Matches will never be greater than Total Matches.

Coruscant11 commented 1 year ago

Ho sorry, maybe I explained bad.

I think that an image will be more clear :laughing:

Screenshot from 2023-02-20 20-59-33

In two runs, the total matches amount are not the same in very larges repositories.

bradlarsen commented 1 year ago

@Coruscant11 That is surprising.

If you run noseyparker summarize --datastore np.vegas multiple times, does it always report the same numbers, or do those change from run to run?

Coruscant11 commented 1 year ago

It seems that it is related to the scan :

I will create an issue for this later :smile:

bradlarsen commented 1 year ago

Yeah that doesn't look right! Thanks for reporting that. A separate issue would be perfect.

bradlarsen commented 1 year ago

@Coruscant11 I created a new issue for the strange behavior your see: #32

bradlarsen commented 1 year ago

I've heard it would also be useful to have an option to skip digging into Git history altogether. Noting that here.

praetorian-inc / noseyparker

Add commits range for `scan` in Git repositories #29