Repository organization

rgmz commented 1 month ago

I'm slowly working on restructuring that repo to be more organized. Most of it was dumped from automatic output from the old test framework that generated a lot of them on the fly.

Any interest in adding things and/or organizing is totally welcome!

Originally posted by @bplaxco in https://github.com/leaktk/fake-leaks/issues/3#issuecomment-2358511113

What did you have in mind for organization? In my test suite I organized things by rule-id/{valid,invalid}/* which made it easy to detect FP & FN.

bplaxco commented 1 month ago

I do like the {valid,invalid} bits in the name, that would make it easier to spot in the test results when we got a false positive. :ok_hand:

I was thinking instead of by rule-id we'd do by platform. So Slack, kubernetes, npm, pypi, etc so that I can group the different types of secrets together, and I'm intentionally keeping my rule ids as just random characters so it might get hard to tell what's what at a glance in the repo.

Yeah, I think the structure like this would be good.

fake-leaks
├── aws
│   ├── invalid
│   │   ├── iam-unique-identifier
│   │   └── secret-key
│   └── valid
│       ├── iam-unique-identifier
│       └── secret-key
└── github
    ├── invalid
    │   ├── fine-grained-personal-access-token
    │   ├── personal-access-token
    │   └── refresh-token
    └── valid
        ├── fine-grained-personal-access-token
        ├── personal-access-token
        └── refresh-token

I also don't want to tie it specifically to the patterns tests. I want it to be a useful example repo for anyone that wants to use it and for other tools to test against, so I think that's generic enough to support that.

@thewizzy && @abutcher does that seem legit to y'all too?

bplaxco commented 1 month ago

Or actually it might make sense to to do this structure instead:

fake-leaks
├── aws
│   ├── iam-unique-identifier
│   │   ├── invalid
│   │   └── valid
│   └── secret-key
│       ├── invalid
│       └── valid
└── github
    ├── fine-grained-personal-access-token
    │   ├── invalid
    │   └── valid
    ├── personal-access-token
    │   ├── invalid
    │   └── valid
    └── refresh-token
        ├── invalid
        └── valid

That'd group the types of secrets together at least.

Anyone here have a preference?

rgmz commented 1 month ago

Or actually it might make sense to to do this structure instead:

I think grouping all secrets into a single file coule make it harder to identify false negatives.

bplaxco commented 1 month ago

I think grouping all secrets into a single file coule make it harder to identify false negatives.

It'd group them by folder, not by file. They'd still be in separate valid/invalid files. The benefit of the second structure is just that all github refresh tokens would be in the same folder instead of spread across two folders, etc.

So if you're adding a new type of secret you could do it all from within the same folder.

rgmz commented 1 month ago

Ah, so valid/invalid are folders here and it's just grouped differently? That makes sense — I thought they were files.

fake-leaks
├── aws
│   ├── iam-unique-identifier
│   │   ├── invalid
│   │   └── valid

bplaxco commented 1 month ago

Heh, sorry for the back and forth, I may have misunderstood this:

I think grouping all secrets into a single file coule make it harder to identify false negatives.

So like aws/iam-unique-identifier/valid would be a file with things like:

# Access key
AKIAI44QH8DHBGXFM9LE

# Managed policy
ANPA44QH8DHBGXFM9LE

...

So like aws/iam-unique-identifier/invalid would be a file with things like:

# Example Access key
AKIAI44QH8DHBEXAMPLE

# Managed policy
ANPA44QH8DHBEXAMPLE

...

But now I do see what you mean about it being harder to tell if both keys were matched in aws/iam-unique-identifier/valid. It would be handy to be able to have a script check to tell if all valid keys were accounted for in the test results.

Maybe something like

fake-leaks
├── aws
│   ├── normal-access-key.valid
│   ├── normal-access-key.invalid
│   ├── quoted-access-key-in-json.valid
│   ├── b64-containing-a-segment-that-looks-like-an-access-key.invalid
│   └──<short-description>.{valid,invalid}
...

With only one item per file. It'd keep it pretty flat and be easy to automate checking that everything was found.

Thoughts?

bplaxco commented 1 month ago

Or actually we wouldn't want the valid/invalid to be the extension because the file name matters in some cases.

bplaxco commented 1 month ago

H'okay, how about:

fake-leaks
   {type-or-service}
       {valid,invalid}
           {short-description}

where {short-description} can be a file OR a folder depending on if the path is important but each {short-description} only contains one item?

I think that should cover the use caces.

Like I could have

fake-leaks/htpasswd/valid/normal-htpasswd-file/.htpasswd
fake-leaks/htpasswd/invalid/htpasswd-docs/htpasswd.md
fake-leaks/github/valid/fine-grained-personal-access-token (this one is just a file and not a folder)

thewizzy commented 1 month ago

I think grouping them by platform will help to organise it long term. Your last would work well.

bplaxco commented 1 month ago

Then during the next round of tuning I'll start chipping away at it some.

Any PRs on it are totally welcome 👌

rgmz commented 1 month ago

Another thought from #6:

Should this be nested under jvm/ so that jvm/java-related don't pollute the top level?

If we want to group related cases by a larger category (e.g., Maven, Gradle, Spring Boot, Java are all "jvm"), would that be:

A)

jvm/
├─ valid/
│  ├─ maven-settings-password.xml
├─ invalid/
│  ├─ spring-config-credentials.yaml

or B)

jvm/
├─ maven/
│  ├─ invalid/
│  ├─ valid/
│  │  ├─ maven-settings-password.xml
├─ spring/
│  ├─ valid/
│  ├─ invalid/
│  │  ├─ spring-config-credentials.yaml

Cons of both:

A) could become messy and difficult to manage if you have a large number of sub-categories, each with a large number of valid/invalid cases
B) could create a needless amount of folders/nesting

leaktk / fake-leaks

Repository organization #4