use form of rule schema to deny broken rules

it is possible to use semgreps own validation: https://semgrep.dev/docs/writing-rules/testing-rules#validating-rules

semgrep scan --metrics=off --validate --config /app/rules --json -o broken_rules.json

Broken rules would result in the following broken_rules.json:

{
  "errors": [
    {
      "code": 4,
      "level": "error",
      "long_msg": "One of these properties is missing: 'languages'",
      "short_msg": "Invalid rule schema",
      "spans": [
        {
          "end": {
            "col": 1,
            "line": 27,
            "offset": -1
          },
          "file": "rules/deny-default-namespace.yaml",
          "source_hash": "c16ac57d9db7bb7c762e3775cf1982c20eb2161542acf11b32b04edc26730dea",
          "start": {
            "col": 3,
            "line": 2,
            "offset": -1
          }
        }
      ],
      "type": "InvalidRuleSchemaError"
    },
    {
      "code": 2,
      "level": "error",
      "message": "Semgrep match found at line ./rules/deny-default-namespace.yaml:2:\n Please include a 'languages' field for your rule $RULEID!",
      "path": "./rules/deny-default-namespace.yaml",
      "type": "Semgrep match found"
    },
    {
      "code": 2,
      "level": "error",
      "message": "Rule parse error in rule restrict-image-registry:\n Missing required field regex",
      "rule_id": "restrict-image-registry",
      "type": "Rule parse error"
    }
  ],
  "paths": {
    "scanned": []
  },
  "results": [],
  "skipped_rules": [],
  "version": "1.68.0"
}

There is no definite way to determine the path to a broken rule when scanning multiple rules.

In summary, some considerations are necessary:

Broken rule handling: if a rule fails validation: drop all changes or drop only specific rule?
Local validation: validation can only run on rules that were written to files (see updater)
Update race conditions: writing to files in /app/rules/ and validating afterwards might create race conditions where some deployments may fail. A 3-step process might be required: write to files in /app/dummy-rules, run validation, copy to /app/rules (incl. removal of old rules)
Performance: Running semgrep validation on all rules at once might make it impossible to identify the failing rule.* Running semgrep validation on each rule separately might cause load on semgrep causing failing deployments due to race conditions. One solution might be to identify diffs (rule changes) and only validate those.
Remote availability: According to docs the testing rules are pulled from p/semgrep-rule-lints (running those directly like semgrep scan --config p/semgrep-rule-lints rules does NOT yield the same result). Validation fails if semgrep API is unavailable. Air-gaped systems cannot perform validation which might create need for making validation configurable.

semgr8ns / semgr8s

use form of rule schema to deny broken rules #120