Open rgmz opened 1 month ago
I do like the {valid,invalid}
bits in the name, that would make it easier to spot in the test results when we got a false positive. :ok_hand:
I was thinking instead of by rule-id we'd do by platform. So Slack, kubernetes, npm, pypi, etc so that I can group the different types of secrets together, and I'm intentionally keeping my rule ids as just random characters so it might get hard to tell what's what at a glance in the repo.
Yeah, I think the structure like this would be good.
fake-leaks
├── aws
│ ├── invalid
│ │ ├── iam-unique-identifier
│ │ └── secret-key
│ └── valid
│ ├── iam-unique-identifier
│ └── secret-key
└── github
├── invalid
│ ├── fine-grained-personal-access-token
│ ├── personal-access-token
│ └── refresh-token
└── valid
├── fine-grained-personal-access-token
├── personal-access-token
└── refresh-token
I also don't want to tie it specifically to the patterns tests. I want it to be a useful example repo for anyone that wants to use it and for other tools to test against, so I think that's generic enough to support that.
@thewizzy && @abutcher does that seem legit to y'all too?
Or actually it might make sense to to do this structure instead:
fake-leaks
├── aws
│ ├── iam-unique-identifier
│ │ ├── invalid
│ │ └── valid
│ └── secret-key
│ ├── invalid
│ └── valid
└── github
├── fine-grained-personal-access-token
│ ├── invalid
│ └── valid
├── personal-access-token
│ ├── invalid
│ └── valid
└── refresh-token
├── invalid
└── valid
That'd group the types of secrets together at least.
Anyone here have a preference?
Or actually it might make sense to to do this structure instead:
I think grouping all secrets into a single file coule make it harder to identify false negatives.
I think grouping all secrets into a single file coule make it harder to identify false negatives.
It'd group them by folder, not by file. They'd still be in separate valid/invalid files. The benefit of the second structure is just that all github refresh tokens would be in the same folder instead of spread across two folders, etc.
So if you're adding a new type of secret you could do it all from within the same folder.
Ah, so valid/invalid are folders here and it's just grouped differently? That makes sense — I thought they were files.
fake-leaks
├── aws
│ ├── iam-unique-identifier
│ │ ├── invalid
│ │ └── valid
Heh, sorry for the back and forth, I may have misunderstood this:
I think grouping all secrets into a single file coule make it harder to identify false negatives.
So like aws/iam-unique-identifier/valid
would be a file with things like:
# Access key
AKIAI44QH8DHBGXFM9LE
# Managed policy
ANPA44QH8DHBGXFM9LE
...
So like aws/iam-unique-identifier/invalid
would be a file with things like:
# Example Access key
AKIAI44QH8DHBEXAMPLE
# Managed policy
ANPA44QH8DHBEXAMPLE
...
But now I do see what you mean about it being harder to tell if both keys were matched in aws/iam-unique-identifier/valid
. It would be handy to be able to have a script check to tell if all valid keys were accounted for in the test results.
Maybe something like
fake-leaks
├── aws
│ ├── normal-access-key.valid
│ ├── normal-access-key.invalid
│ ├── quoted-access-key-in-json.valid
│ ├── b64-containing-a-segment-that-looks-like-an-access-key.invalid
│ └──<short-description>.{valid,invalid}
...
With only one item per file. It'd keep it pretty flat and be easy to automate checking that everything was found.
Thoughts?
Or actually we wouldn't want the valid/invalid to be the extension because the file name matters in some cases.
H'okay, how about:
fake-leaks
{type-or-service}
{valid,invalid}
{short-description}
where {short-description}
can be a file OR a folder depending on if the path is important but each {short-description}
only contains one item?
I think that should cover the use caces.
Like I could have
fake-leaks/htpasswd/valid/normal-htpasswd-file/.htpasswd
fake-leaks/htpasswd/invalid/htpasswd-docs/htpasswd.md
fake-leaks/github/valid/fine-grained-personal-access-token
(this one is just a file and not a folder)I think grouping them by platform will help to organise it long term. Your last would work well.
Then during the next round of tuning I'll start chipping away at it some.
Any PRs on it are totally welcome 👌
Another thought from #6:
Should this be nested under jvm/ so that jvm/java-related don't pollute the top level?
If we want to group related cases by a larger category (e.g., Maven, Gradle, Spring Boot, Java are all "jvm"), would that be:
A)
jvm/
├─ valid/
│ ├─ maven-settings-password.xml
├─ invalid/
│ ├─ spring-config-credentials.yaml
or B)
jvm/
├─ maven/
│ ├─ invalid/
│ ├─ valid/
│ │ ├─ maven-settings-password.xml
├─ spring/
│ ├─ valid/
│ ├─ invalid/
│ │ ├─ spring-config-credentials.yaml
Cons of both:
Originally posted by @bplaxco in https://github.com/leaktk/fake-leaks/issues/3#issuecomment-2358511113
What did you have in mind for organization? In my test suite I organized things by
rule-id/{valid,invalid}/*
which made it easy to detect FP & FN.