Using regex to scan for specific credentials

ericwb commented 5 years ago

Is your feature request related to a problem? Please describe. See https://github.com/Yelp/detect-secrets for reference, but it would be useful for Precaution to also warn on secrets, Confidential code headers, etc. So level of detection is done through GitHub token scanning, but not all.

Note that also detect-secrets operates on all files (json, txt, etc), and not just languages files. There are definitely cases where API keys are stuffed into shell scripts, json files, yaml, etc.

Describe the solution you'd like A new, slightly different type of scanner. Not really a linter. And it would scan all types of file extensions, not just language code, so the download of files would need to include everything.

Describe alternatives you've considered Token scanning provided by GitHub does some things by not all. I don't think it would find a Confidential header whereas detect-secrets could be configured to do so. But it would be good to research and do a comparison analysis.

There are also probably many other CLI tools to do similar scans. This was one recommended from a blog article.

Additional context

ericwb commented 5 years ago

This integration could also add some differentiation as compared to similar competing Apps

ericwb commented 5 years ago

Note; detect-secrets overlaps with language specific linters somewhat. It's KeywordDetector plugin will detect variables named password or secrets, same as Bandit.

And many of the other plugins it has could be integrated into language specific linters, since they would be better at detecting using an AST rather than regex.

However, non-language files (.py, .go, .js, .ts) would be best detected by detect-secrets. And there are examples of people putting secret keys in json files, yaml files, etc.

ericwb commented 5 years ago

So should detect-secrets only be used on files not scanned by other linters?

joshuagl commented 5 years ago

I like the idea of detect-secrets only scanning files not scanned by other linters, but I don't think it's as cut and dry as not scanning files scanned by other linters: 1) linters may not support secret detection (i.e. I don't think tslint does) 2) a linters secret detection may not be as complete as detect-secret's , for example I don't think many language security linters do high entropy string detection?

There's a few approaches we could take here:

try to ensure (through contributions, where necessary) that all linters we use have comprehensive secret detection: i. we could ramp up to this by having detect-secrets exclude not run against a list of file types for which we have confidence the language linter handles secret detection. I think this list would start out at zero and increase as we work with the upstream linters?
have some logic to prevent reporting the same/similar issues for the same line of code. This might mean: a. having a model to map similar functionality across linters/scanners which will be run against the same code b. assigning priorities to the linters c. filtering out duplicate reports from lower priority scanners based on the functionality map and priorities

MVrachev commented 5 years ago

There are a few particularities in detect-secrets:

it doesn't provide "severity" and "confidence" level. We should discuss what approach we should use
it doesn't have good documentation for every plugin it uses. The only documentation I could find is this: https://github.com/Yelp/detect-secrets#currently-supported-plugins
it doesn't provide detail information for every issue like the other linters. That means that because the "message" attribute is mandatory in the annotation object https://developer.github.com/v3/checks/runs/#annotations-object we should figure out what detail message we would provide for the different issues it finds.

We should discuss what to do with those.

MVrachev commented 5 years ago

I think that we can try and see how an option 2

have some logic to prevent reporting the same/similar issues for the same line of code.

implemented with the logic from subpoint c

c. filtering out duplicate reports from lower priority scanners based on the functionality map and priorities

MVrachev commented 5 years ago

Here is the list of the pull requests/features we should add to fully integrate detect-secrets into Precaution:

[ ] Spawn functionality + unit tests for it
[ ] Report functionality + unit tests for it
[ ] Add detect-secrets in the README.md and in the doc pages:
- initial_setup
- false-positives
- setting up for manual deployment?

MVrachev commented 5 years ago

Currently, detect-secrets doesn't give you the option to use it with multiple parameters like that: detect-secrets scan test.js test.yaml https://github.com/Yelp/detect-secrets/issues/180

That's why to achieve scanning multiple parameters we have to call it like that: detect-secrets scan .

The problem is that currently, we are using two folders for our temporary files which will be processed cache/go/src/<repo_id>/<pr_id>/ for all Go files and cache/<repo_id>/<pr_id>/ for all other files. If you want to read why we use two folders read the discussions here:

What happens because of that is that when we start detect-secrets from the cache/go/src/<repo_id>/<pr_id>/ directory it will scan all go files with ease but it will miss all other files in the cache/<repo_id>/<pr_id>/ folder.

If we start detect-secrets from cache/<repo_id>/<pr_id>/ folder we would scan all Go files below and if there are not deleted files from an old scan we would report for their errors too.

Because of all those problems, I believe that first, we should create a Docker image which will allow us to use Go modules in Gosec (as described here: https://github.com/vmware/precaution/issues/227) and then we would continue with the detect-secret spawn functionality.

MVrachev commented 5 years ago

I made a research for other tools besides detect-secrets and here are my findings:

Tools using the Shannon entropy for minimizing the false-positives:

There are a lot of tools focused on finding secrets which are working on Git repositories. Those tools won't work for us because we are scanning files under the cache/ folder. But cache/ is in our gitignore file so those files will be missed.

nishakm commented 5 years ago

How about scanning one at a time?

MVrachev commented 5 years ago

How about scanning one at a time?

detect-secrets doesn't support multiple parameters right now: https://github.com/Yelp/detect-secrets/issues/180 which means that we should give one parameter - a folder to detect-secrets which contains all files to be scanned

As I mentioned in this comment https://github.com/vmware/precaution/issues/209#issuecomment-494351377 we have a problem with that which will be resolved if we have Dockerfile.

MVrachev commented 5 years ago

We reevaluated the benefits of using detect-secrets HighEntropyString plugin. We found out that this plugin won't work for us because we used the default entropy configuration and then detect-secrets didn't catch an obvious hardcoded credentials. @joshuagl decided to configure it a little bit and he got a lot of false positives.

It's meaningless to use detect-secrets without the HighEntropyString and that's why we decided that we would write our own implementation using regular expressions to scan for specific tokens. Because that would be a nontrivial feature to implement we decided to move this feature to the Near future milestone.

MVrachev commented 5 years ago

It's important when implementing the regex scan how we would store the files for scan in the memory because Heroku provides us with limited RAM https://devcenter.heroku.com/articles/limits#dyno-memory. One of the ways to process the files will be by fixing the maximum amount of memory we want our files to use and then before opening a file:

check the size of the file
if the currentUsedMemory + the size of the file > maximumRAMusage then we have to use 4096 bytes (or another size) of blocks to process that file
if we can load the file in memory and use less or equal memory of the defined maximumRAMusage memory then load the whole file in the memory

PS: Closed the issue by mistake

securesauce / precaution-beta

Using regex to scan for specific credentials #209