Open Coruscant11 opened 1 year ago
Hi @Coruscant11. This is a good use case and a feature that would be nice to have.
One challenge with implementing this in Nosey Parker is that Git repos are not scanned commit-by-commit, but instead, all blobs found in the repository are scanned. (This Git scanning technique uncovers more things than going commit-by-commit.)
To add this feature to Nosey Parker, we would need to add some alternative Git enumeration mechanism that would walk commit-by-commit and only select blobs reachable from the desired set of commits. The current source for Git repo enumeration is here.
Another thing to consider is the CLI for this added feature. noseyparker scan
currently takes a list of paths as inputs; these paths can be files or directories. Would the --since_commits COMMIT
and --until_commits COMMIT
options apply to all the specified paths? It might be better to extend the newly-added --git-url URL
input specifier to accept not just an HTTPS URL, but additionally with a Git revision specifier. So it might look something like --git-url https://github.com/praetorian-inc/noseyparker.git@dae86e19..HEAD
.
Another related change to this that I'd like to make in Nosey Parker is to keep track of which inputs have already been scanned, and avoid rescanning them if possible.
Currently, noseyparker scan -d DATASTORE INPUT
will completely enumerate and scan INPUT
from scratch. Nosey Parker is fast, but for large repositories (like the LInux kernel, with 100+GB of blobs), it still takes a couple minutes. However, simply enumerating contents goes pretty quickly, especially in Git repositories (e.g., the Linux kernel repo can be enumerated in 13-25 seconds, depending on filesystem cache). If Nosey Parker kept track of which blobs it had scanned and with which set of rules, it could avoid re-scanning things.
I'm going to make a separate issue for this.
One challenge with implementing this in Nosey Parker is that Git repos are not scanned commit-by-commit, but instead, all blobs found in the repository are scanned. (This Git scanning technique uncovers more things than going commit-by-commit.)
That is what I thought. With other scanners which takes the commits by commits way, some repos can take few hours to scan while noseyparker took only 15 seconds. The purpose of this issue is to save time, but if the scanner is that fast, it is not necessarily worth to implement this issue very quickly.
But even so, a feature to scan specific revision would be very nice I think! And why not specify a datastore as you said in order to not duplicate scans. For the rare people which are working on insanely huge repositories :smile:
For the git revision scan, here is my personal use case :
Datastore are nice, but I think also that in some cases you do not want to rely too much on that, for example on CI/CD when you do not know where can your program run. That is what I am doing at work, I have a very tiny API which has the role to save only commits scan history, but not secrets.
But this scanner seems so fast that it become a way more tiny problem.
I had a question, in some repositories, the scanner will found the same amount of distinct match at every run but not the same amount of total matches. Do you know why ? I do not think that it is an issue but I was wondering why.
Eitherway, the scan method of noseyparker seems very awesome. Very fast, and as you said, discover way more things. :smile:
I had a question, in some repositories, the scanner will found the same amount of distinct match at every run but not the same amount of total matches. Do you know why ? I do not think that it is an issue but I was wondering why.
I think you're talking about the summary table? For example, from scanning Nosey Parker's repo itself, you get something like this:
Rule Distinct Matches Total Matches
────────────────────────────────────────────────────────────────────────────────────────────
PEM-Encoded Private Key 76 276
bcrypt Hash 32 226
Generic API Key 25 131
md5crypt Hash 23 953
Generic Secret 23 245
AWS API Key 17 95
Microsoft Teams Webhook 12 12
Credentials in PsExec 11 12
Azure App Configuration Connection String 10 36
The numbers here for each rule indicate how many times that rule matched across all the scanned inputs.
Distinct Matches
is the number of distinct groups extracted from the rule's regex (e.g., 951bc382db9abad29c68634761dd6e19
from the input - 'API_KEY = "951bc382db9abad29c68634761dd6e19"'
for Generic API Key
). This number is more representative of the number of unique things found from scanning.
Total Matches
, in contrast, is simply the total number of times that rule matched across all the scanned inputs, without any concern for the content of regex groups. If some secret appears in 10 different files, those will all be included in Total Matches
, even though they are all the same.
Distinct Matches
will never be greater than Total Matches
.
Ho sorry, maybe I explained bad.
I think that an image will be more clear :laughing:
In two runs, the total matches amount are not the same in very larges repositories.
@Coruscant11 That is surprising.
If you run noseyparker summarize --datastore np.vegas
multiple times, does it always report the same numbers, or do those change from run to run?
It seems that it is related to the scan :
I will create an issue for this later :smile:
Yeah that doesn't look right! Thanks for reporting that. A separate issue would be perfect.
@Coruscant11 I created a new issue for the strange behavior your see: #32
I've heard it would also be useful to have an option to skip digging into Git history altogether. Noting that here.
Hi :wave:
A great option in secret scanner is to be able to scan a range of commits, for example by adding an option to
scan
.In my case, we use scanners for very large repositories. Once reported, in futures runs there will be no need to scan previously scanned commits. Only new commits are relevant. It saves a lot of time in large repositories.
Gitleaks has this feature , and Trufflehog too.
For example a
since_commits
option, scanning between a specific commit andHEAD
. And why not auntil_commits
option.Do you see any blocking issues for this enhancement?
:smile: