Faster include/exclude parsing

kiwiz commented 1 year ago

Is your feature request related to a problem? Please describe.

I'm running Semgrep within a large (~98K files) monorepo environment. When scanning individual projects, I can pass them to semgrep as targets. However, (for coverage reasons) I'd also like to be able to scan all files that are not part of a project. The way I'm implementing this currently is via a large (~850) number of --exclude arguments. This is really slow! I've done some profiling, and the bulk of the execution time happens in TargetManager.globfilter. I've made some minor optimizations on my fork (which I'll open up a PR for once I've fully tested). However, I'd also like to discuss alternatives/additions to the include/exclude functionality.

Describe the solution you'd like

Perhaps something like a --include-dir/--exclude-dir that does exact prefix matches?

Describe alternatives you've considered

Optimizing the regexp objects generated by wcmatch
Not having a large number of --exclude arguments 😄

Use case What will this feature enable for you?

The (somewhat niche) usecase of being able to specify a large number of exclude directives.

Additional context

Relates to:

stale[bot] commented 1 year ago

This issue is being marked stale because there hasn't been any activity in 14 days and either it wasn't prioritized or its priority is high. Please apply the appropriate priority:* label before removing the stale label.

stale[bot] commented 1 year ago

Stale-bot has closed this stale item. Please reopen it if this is in error.

kiwiz commented 1 year ago

Could this be reopened (if it is deemed useful to discuss)?

kiwiz commented 1 year ago

@emjin :wave: Could this issue be reopened?

stale[bot] commented 1 year ago

This issue is being marked stale because there hasn't been any activity in 14 days and either it wasn't prioritized or its priority is high. Please apply the appropriate priority:* label before removing the stale label.

stale[bot] commented 1 year ago

Stale-bot has closed this stale item. Please reopen it if this is in error.

ievans commented 1 year ago

cc @aryx @mjambon perhaps this can be addressed in the osemgrep porta

aryx commented 1 year ago

cc @mjambon. Maybe this is similar to the issue we currently experiencing in osemgrep. Wonder if improving Gitignore would also improve this use case.

mjambon commented 1 year ago

This is an interesting problem that is now critical since the new semgrepignore mechanism in osemgrep no longer relies on the optimizations used by git. Presumably, git (git ls-files) relies on an index of all the files under version control to quickly produce a list, rather than consulting the whole file tree and gitignore filters. git status also is pretty fast despite having to scan the file tree for new files. This is compatible with the performance bottleneck being checking many files against many glob patterns. I haven't thought about this issue deeper than this yet.

Possible families of solutions:

make glob pattern matching faster (although it seems unlikely to improve much)
use some form of caching between semgrep runs. The question is where to keep this cache such that it works for all users.

mjambon commented 1 year ago

semgrep / semgrep

Faster include/exclude parsing #6556