[KF 1.0 Compliance] Vulnerability Scanning

Bobgy commented 4 years ago

Part of https://github.com/kubeflow/pipelines/issues/2884

Docker images must be scanned for vulnerabilities and known vulnerabilities published

@jlewi Do you know how other images share vulnerability issues?

I did a quick investigation, gcr.io provides vulnerability scanning, but the result is not visible to external visitors even if the image is public.

We can export the generated yaml report with commands like

gcloud beta container images describe --show-package-vulnerability gcr.io/ml-pipeline/api-server:1.0.0-test-5

Documented in https://cloud.google.com/container-registry/docs/get-image-vulnerabilities

Do you think that's good enough?

Bobgy commented 4 years ago

@jbottum Do you have any ideas about this?

jlewi commented 4 years ago

kubeflow/kubeflow#3907 is tracking how we publish a list of vulnerabilities in our images.

A related issue is minimizing vulnerabilities e.g. by using distroless images. There is documentation at https://github.com/krishnadurai/community/blob/b1669588d785455a1e4e4cab456e03c08a05af7c/guidelines/creating_dockerfiles.md

Note the use of distroless images is recommended not a requirement.

kubeflow/kubeflow#4590 is a related issue about promoting the use of distroless in Kubeflow to minimize vulnerabilities.

To satisfy the vulnerability scanning requirement I think you just need to turn on vulnerability scanning in whatever GCR registry you are hosting your images in.

You might want to repurpose this issue or file a new one for reducing vulnerabilities if relevant.

Bobgy commented 4 years ago

@jlewi As reported in the kubeflow/kubeflow#3907, if we enable gcr vulnerability scanning, they are not visible for external viewers. So in addition to that we'd still need to dump a yaml report for each KFP release, sounds reasonable?

Bobgy commented 4 years ago

Thanks for the relevant link to reducing vulnerability. I'll create a separate issue about it.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Bobgy commented 4 years ago

/lifecycle frozen

Bobgy commented 3 years ago

An example of fixing some vulnerability issues: https://github.com/kubeflow/pipelines/issues/4531

some related readings:

My take aways:

it's impossible fixing all vulnerabilities and some (or probably many) can actually be false positives, so we need to
constantly update base images to get upstream fixes
if there hasn't been a fix in upstream, we need to review the vulnerability and see if it really matters to us, then act accordingly

going forward, we should:

utilize distroless images as much as possible (because they have near 0 vulnerability from the base image)
when not feasible, constantly update the base image to get vulnerability fixes
act in ad-hoc for remaining high/critical vulnerability that people care about

Bobgy commented 3 years ago

AIs:

[x] formalize a vulnerability management process
[ ] understand current image vulnerability status and triage urgent fixes
[ ] build the needed vulnerability scanning automation that flag High and Critical issues (p1) before release and send vulnerability reports (p2) for each KFP release.

Bobgy commented 3 years ago

Requests to reduce vulnerabilities come more often than before, so I'm taking some time to continue this.

Bobgy commented 3 years ago

Formalize a vulnerability management process

I think the process should come with two parts:

Set up a process to update dependencies/base images more frequently. This is already being addressed in https://github.com/kubeflow/pipelines/issues/4682
Add an automated vulnerability policy check step in our CI/CD pipelines. In the pipeline, we'll unavoidably need to allowlist many CVEs (maybe even of high/critical level), because a fix may not have been released, or the CVE may not be exploitable in KFP use-case, or maybe risk is tolerable. We should add comment on this whitelist about the reasons, and mark some of them as TODOs.

I'll focus on 2. in this issue.

Bobgy commented 3 years ago

Research of tools suitable for this need:

Google Cloud provides vulnerability scanning in container analysis service, but it can only provide information for what we need. It lacks required tools to integrate in a CI/CD pipeline. https://cloud.google.com/container-analysis/docs/vulnerability-scanning

Kritis is a nice tool built by GCP, https://cloud.google.com/binary-authorization/docs/creating-attestations-kritis#check-only. It supports vulnerability policy like the following and integrates with data from Google Container Analysis:

apiVersion: kritis.grafeas.io/v1beta1
kind: VulnzSigningPolicy
metadata:
  name: my-vsp
spec:
  imageVulnerabilityRequirements:
    maximumFixableSeverity: MEDIUM
    maximumUnfixableSeverity: MEDIUM
    allowlistCVEs:
    - projects/goog-vulnz/notes/CVE-2020-10543
    - projects/goog-vulnz/notes/CVE-2020-10878
    - projects/goog-vulnz/notes/CVE-2020-14155

Using them combined seem to meet our basic needs.

Bobgy commented 3 years ago

There seems to be similar open source tools like https://github.com/arminc/clair-scanner, but it requires running your own vulnerability server. It's more convenient to use GCP container analysis service directly.

Bobgy commented 3 years ago

A bit more research lead me to https://github.com/aquasecurity/trivy. It seems the leading open source option. There are some extra nice features:

a local CLI for exploration -- it can group CVEs by library type:

$ trivy image knqyf263/vuln-image:1.2.3
2019-05-16T12:59:03.150+0900    INFO    Detecting Alpine vulnerabilities...
2019-05-16T12:59:04.941+0900    INFO    Detecting bundler vulnerabilities...
2019-05-16T12:59:05.967+0900    INFO    Detecting cargo vulnerabilities...
2019-05-16T12:59:07.834+0900    INFO    Detecting composer vulnerabilities...
2019-05-16T12:59:10.285+0900    INFO    Detecting npm vulnerabilities...
2019-05-16T12:59:11.487+0900    INFO    Detecting pipenv vulnerabilities...

knqyf263/vuln-image:1.2.3 (alpine 3.7.1)
========================================
Total: 26 (UNKNOWN: 0, LOW: 3, MEDIUM: 16, HIGH: 5, CRITICAL: 2)

+---------+------------------+----------+-------------------+---------------+----------------------------------+
| LIBRARY | VULNERABILITY ID | SEVERITY | INSTALLED VERSION | FIXED VERSION |              TITLE               |
+---------+------------------+----------+-------------------+---------------+----------------------------------+
| curl    | CVE-2018-14618   | CRITICAL | 7.61.0-r0         | 7.61.1-r0     | curl: NTLM password overflow     |
|         |                  |          |                   |               | via integer overflow             |
+         +------------------+----------+                   +---------------+----------------------------------+
|         | CVE-2018-16839   | HIGH     |                   | 7.61.1-r1     | curl: Integer overflow leading   |
|         |                  |          |                   |               | to heap-based buffer overflow in |
|         |                  |          |                   |               | Curl_sasl_create_plain_message() |
+         +------------------+          +                   +---------------+----------------------------------+
|         | CVE-2019-3822    |          |                   | 7.61.1-r2     | curl: NTLMv2 type-3 header       |
|         |                  |          |                   |               | stack buffer overflow            |
+         +------------------+          +                   +---------------+----------------------------------+
|         | CVE-2018-16840   |          |                   | 7.61.1-r1     | curl: Use-after-free when        |
|         |                  |          |                   |               | closing "easy" handle in         |
|         |                  |          |                   |               | Curl_close()                     |
+         +------------------+----------+                   +               +----------------------------------+
|         | CVE-2018-16842   | MEDIUM   |                   |               | curl: Heap-based buffer          |
|         |                  |          |                   |               | over-read in the curl tool       |
|         |                  |          |                   |               | warning formatting               |
+         +------------------+          +                   +---------------+----------------------------------+
|         | CVE-2018-16890   |          |                   | 7.61.1-r2     | curl: NTLM type-2 heap           |
|         |                  |          |                   |               | out-of-bounds buffer read        |
+         +------------------+          +                   +               +----------------------------------+
|         | CVE-2019-3823    |          |                   |               | curl: SMTP end-of-response       |
|         |                  |          |                   |               | out-of-bounds read               |
+---------+------------------+----------+-------------------+---------------+----------------------------------+
| git     | CVE-2018-17456   | HIGH     | 2.15.2-r0         | 2.15.3-r0     | git: arbitrary code execution    |
|         |                  |          |                   |               | via .gitmodules                  |
+         +------------------+          +                   +               +----------------------------------+
|         | CVE-2018-19486   |          |                   |               | git: Improper handling of        |
|         |                  |          |                   |               | PATH allows for commands to be   |
|         |                  |          |                   |               | executed from...                 |
+---------+------------------+----------+-------------------+---------------+----------------------------------+
...

there are existing github actions that use trivy: https://github.com/Azure/container-scan

Bobgy commented 3 years ago

For reference, vulnerability vector description: https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator

Bobgy commented 3 years ago

An experimental feature of trivy is to use user defined open agent policy as checker for the vulnerabilities. It can be used to filter based on vulnerability vector, examples include:

ignore all vulnerabilities that cannot be exploited via network
ignore those that cannot be exploited with root permission
...

So it can reduce the amount of vulnerabilities we need to check based on our specific environment requirements.

References:

Bobgy commented 3 years ago

EDIT: what's described below doesn't work well, because the result of gcloud beta container images describe --show-package-vulnerability gcr.io/ml-pipeline/api-server:1.0.0-test-5 --format=json does not provide information on vulnerability vector.

Open Policy Agent is in fact a generic tool:

inputs: "JSON" and "Policy" output: "pass?"

So we could just use it with gcr vulnerability scanning to get the best of both flexibility using a GCP managed service.

==

or alternatively we can just write a script to check the vulnerability JSON as our own policy.

Bobgy commented 3 years ago

Analysis of Options

Trivy

Onboarding cost: low (download a binary and run it)
Vulnerability DB confidence: unknown (it's a third party maintained DB, although it claims its sources are the common ones like NVD etc)
Configuration flexibility: high (especially with OPA)
Momentum: high (6k stars, 18 PRs merged and 12 issues closed last month -- at time of evaluation)

Kritis

Onboarding cost: low (there're official docs for using it in Cloud Build, it's a container)
Vulnerability DB confidence: very high (it uses GCP image scanning)
Configuration flexibility: medium (allowlist + filter by [fixable, severity])
Momentum: low (the repo have 0 new activities recently)

Other options look obviously worse than the two, so I'm leaving them out.

To note that, OPA looks like it has some learning curve because there's a new language to learn, so I'd prefer we stay away from it initially. Therefore, if not using OPA, Trivy's major advantage does not apply to us.

I think we can start with Kritis, if it proves to work as it is, we can delay further customization when we really need to. If we discover blocking bugs, we can revisit Trivy as a backup plan.

shawnzhu commented 3 years ago

I'm interested in this issue. speaking of trivy, it supports filtering vulnerabilities by a number of options besides OPA:

--severity - https://github.com/aquasecurity/trivy#filter-the-vulnerabilities-by-severities
.trivyignore (ignore spedific vulnerabilities) - https://github.com/aquasecurity/trivy#ignore-the-specified-vulnerabilities
--skip-files - https://github.com/aquasecurity/trivy#skip-traversal-of-the-specific-files
--skip-dirs - https://github.com/aquasecurity/trivy#skip-traversal-in-the-specific-directory

the lack of activity of Kritis might be a problem, but willing to give it a try since I haven't use it before.

Bobgy commented 3 years ago

@shawnzhu You are right.

I didn't make it clear that my major preference for kritis is -- it uses GCP container scanning as data source (in fact, it directly reads GCP container scanning results, so you cannot use it outside GCP)

Bobgy commented 3 years ago

Some notes after experimenting with Kritis:

Although the official sample is in Cloud Build, I found it much faster in terms of developer speed writing a KFP pipeline that runs vulnerability checks using Kritis
Kritis does not output structured information for vulnerability check results, we can only look at its logs like

E0201 01:43:02.099893 1 main.go:211] found fixable CVE \<redacted> in gcr.io/\<redacted>, which has severity HIGH exceeding max fixable severity MEDIUM

Bobgy commented 3 years ago

I built a KFP pipeline that runs Kritis: https://github.com/kubeflow/pipelines/pull/5066. This is now a one off pipeline I use to verify existing released images.

P1 The next steps would be maintaining a long running KFP test cluster and run that pipeline as one of the post submit tests.

davidspek commented 3 years ago

There seems to be similar open source tools like https://github.com/arminc/clair-scanner, but it requires running your own vulnerability server. It's more convenient to use GCP container analysis service directly.

@Bobgy I think this is a better link: https://github.com/quay/clair. Clair is what Amazon ECR uses: https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-scanning.html.

kubeflow / pipelines

[KF 1.0 Compliance] Vulnerability Scanning #3857

Formalize a vulnerability management process

Analysis of Options