semgrep / semgrep

Lightweight static analysis for many languages. Find bug variants with patterns that look like source code.
https://semgrep.dev
GNU Lesser General Public License v2.1
10.65k stars 624 forks source link

Semgrep tries to pull registry when `--validate` is on #4620

Open khanhldt opened 2 years ago

khanhldt commented 2 years ago

Describe the bug Trying to set up a pre-commit hook like below:

hooks:
-   repo: https://github.com/returntocorp/semgrep
    rev: 'v0.80.0'
    -   id: semgrep
        name: Validate Semgrep rule files
        args: [
            '--validate',
            '--metrics', 'off',
            '--config', 'PATH/TO/CONFIG',
            '--disable-version-check'
        ]

It failed to run in our CI because Semgrep attempted to pull from registry even though the documentation said it should not (and our CI blocks network call).

So i tried to remove --validate flag and it looks like it doesn't try to pull from registry anymore. Also, looking at the change log, i believe this was changed from 0.71.0 (here).

To Reproduce

Get semgrep version 0.80.0 (i believe any version after 0.70.0 has this issue) and run this command:

semgrep --validate --metrics off --config=path/to/config --disable-version-check --debug

You will see that semgrep tries to pull from the registry.

Expected behavior

There must be a way to configure Semgrep to not reach out to network while running. Either by the existing flags or a new flag.

Screenshots

(let me know if you need)

What is the priority of the bug to you?

Environment

local buid

Use case

Running Semgrep in network-restricted environment like CI.

r2c-demo commented 2 years ago

This issue is synced in Linear at https://linear.app/r2c/issue/PA-823/semgrep-tries-to-pull-registry-when-validate-is-on.

IagoAbal commented 2 years ago

cc @brendongo @underyx

brendongo commented 2 years ago

semgrep pulls from registry to pull the extra validation rules https://github.com/returntocorp/semgrep/blob/develop/semgrep/semgrep/core_runner.py#L454

brendongo commented 2 years ago

Maybe we need to update documentation. @khanhldt where did you find that --validate says it doesn't pull rules from registry?

khanhldt commented 2 years ago

I think I read it in a couple of places, but one is from the semgrep --help saying This will check YAML files for errors and run 'p/semgrep-rule-lints' on the YAML files. No search is performed.

Another thing is that I've been using version 0.65.0 in our CI and its been working fine. With this change in newer versions, I'm not able to get Semgrep to run in our CI anymore.

How can I turn off network calls from Semgrep?

ievans commented 2 years ago

As far as I understand the use case, I suspect it's semi-related to https://github.com/returntocorp/semgrep/issues/3147

E.g., if we bundled p/semgrep-rule-lints with release, --validate would be robust to network requirements.

@khanhldt is this blocking you? can you remove the --validate step for now?

khanhldt commented 2 years ago

Not really blocking as I switched to use version 0.70.0.

I'm using this hook to validate my Semgrep rules so --validate is necessary.

brendongo commented 2 years ago

@emjin was there a particular reason we went with not embedding the rule lints or is it fine to proceed with embedding them per release?

emjin commented 2 years ago

Embedding them should be fine

ievans commented 2 years ago

Also, duplicate of https://github.com/returntocorp/semgrep/issues/4454

khanhldt commented 2 years ago

There are few new features in the new versions that I'd like to use. Would be great if we can have a quick way of unblocking this.

yeongzhiwei commented 1 year ago

Checking in to see if there's any update on this. Would love to --validate in network-less environment

fopina commented 4 months ago

💯 ability to run validate without network (or with pre-downloaded version of semgrep-rule-lints file) strongly desired.

Which validations are done in --validate that are not done by simply running a search with the rule / loading the rule?

Is https://github.com/semgrep/semgrep/blob/6fbd7aebf45410a4d3d8167ab63b9d3e8e4c0989/src/osemgrep/cli_scan/Validate_subcommand.ml#L11-L22 still accurate?

If so, which are NOT done by a search command? First one (YAML parser) for sure is. And I believe schema validation as well.
But is that semgrep-core ... --check_rules also called?

emjin commented 4 months ago

The reason why --validate pulls rules from the registry is because we have wanted to be able to dynamically update the validation rules without waiting for a release. At the moment, we don't plan to change that.

fopina commented 4 months ago

@emjin and that makes sense to me, just there should be an option not to, otherwise we cannot use validate in a container/VM without public internet access (common in many CIs)...

But if:

semgrep --validate -f my-rule.yaml

Runs the same checks done by:

semgrep -f my-rules.yaml bogys_file.txt
semgrep -f p/semgrep-rule-lints my-rules.yaml

I'm happy to not use --validate at all, as the validation (except lints) would be done already by the test scans performed with the rules

But I'm unable to be sure of that by looking at the code, could you confirm?

Thanks

emjin commented 4 months ago

There is one check that --validate runs which semgrep -f p/semgrep-rule-lints would not (it checks for metavariables that referenced out of scope). However, we have had to relax that check as we changed how metavariables behave, so at this point I think the p/semgrep-rule-lints run is the real value.

fopina commented 4 months ago

Awesome, thank you @emjin !

Is there any way to call that check alone at the moment to cover future versions? Is it sempre-core --check-rules?