goharbor / harbor

An open source trusted cloud native registry project that stores, signs, and scans content.
Apache License 2.0
24.14k stars 4.76k forks source link

ScanAll not triggering for all images #18455

Closed dioguerra closed 12 months ago

dioguerra commented 1 year ago

Edit: Harbor Version v2.5.2 -- I know i know, update but please, read on...

So, I have a Daily ScanAll Job activated, which does not seem to be scanning all Objects. I did some investigation on the source code on my own, but seem to arrive to no conclusion. First and foremost, what are the projects that the scanner triggers the scan jobs?

  1. ScanAll command scans only artifacts on projects that have the Automatically scan images on push option enabled?
  2. ScanAll command scans all artifacts independent of option selected by user?

Personally I am expecting that 1. happens. Documentation is not explicit on this: https://goharbor.io/docs/2.7.0/administration/vulnerability-scanning/scan-all-artifacts/

A while back, our production image CVE scanner was not working https://github.com/aquasecurity/trivy/issues/3894 (fixed now) and while new pushed images are being actively scanned, the daily scanner does not pick up the Images with a Vulnerabilities error state: image

Even with Automatically scan images on push enabled. In this case, this is a proxy-cache repository.

So, if we check the Interrogation Service -> Vulnerabilities I see ~ 2000 images scanned. image

Now, for the funzies part. I went through the src code starting with the scan_all.createOrUpdateScanAllSchedule call and arrived somewhere to:

From what I managed to find out there dosent seem to be any filter for a specific artifact type, which I find quite suspitious? I would expect the scan to use the manifest artifact and then the scanner itself(trivy) would pull all the dependent layers for scan. The vulnerabilities would then be aggregated and the vulnerability report would match the manifest only: image

Here is an example of a query where one image is scanned and other is not:

SELECT a.*, sr.*
    FROM public.artifact AS a
    LEFT JOIN public.scan_report AS sr
        ON a.digest = sr.digest
        a.digest IN ('sha256:b78baa730828d2f64e2d6a41f0314124147c3405c3890a444529bab6b1cebb6c',
    LIMIT 10;
"id"    "project_id"    "repository_name"   "digest"    "type"  "pull_time" "push_time" "repository_id" "media_type"    "manifest_media_type"   "size"  "extra_attrs"   "annotations"   "icon"  "id-2"  "uuid"  "digest-2"  "registration_uuid" "mime_type" "report"
23868   9   "grafana/grafana-image-renderer"    "sha256:9f8141df387e60ae7774e4dbaca0fbc1e6149b333a14688ee19027e499a92cc5"   "IMAGE" "2023-03-30 16:13:50.039272"    "2022-07-27 13:30:30.28302" 797 "application/vnd.docker.container.image.v1+json"    "application/vnd.docker.distribution.manifest.v2+json"  283498100   "{""architecture"":""amd64"",""author"":""Grafana team \u003chello@grafana.com\u003e"",""config"":{""User"":""grafana"",""ExposedPorts"":{""8081/tcp"":{}},""Env"":[""PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"",""NODE_VERSION=14.20.0"",""YARN_VERSION=1.22.19"",""CHROME_BIN=/usr/bin/chromium-browser"",""PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true"",""GF_PATHS_HOME=/usr/src/app"",""NODE_ENV=production""],""Entrypoint"":[""dumb-init"",""--""],""Cmd"":[""node"",""build/app.js"",""server"",""--config=config.json""],""WorkingDir"":""/usr/src/app"",""Labels"":{""maintainer"":""Grafana team \u003chello@grafana.com\u003e""}},""created"":""2022-07-18T12:22:04.992730607Z"",""os"":""linux""}"          2398754 "86a5c3ff-791b-46b3-902f-75d301d25b44"  "sha256:9f8141df387e60ae7774e4dbaca0fbc1e6149b333a14688ee19027e499a92cc5"   "6809b473-11fb-11eb-93ee-e6a720a2df22"  "application/vnd.security.vulnerability.report; version=1.1"    
33089   9   "grafana/grafana-image-renderer"    "sha256:b78baa730828d2f64e2d6a41f0314124147c3405c3890a444529bab6b1cebb6c"   "IMAGE" "2023-03-30 16:41:38.392286"    "2022-09-16 15:31:52.982544"    797 "application/vnd.docker.container.image.v1+json"    "application/vnd.docker.distribution.manifest.v2+json"  300672683   "{""architecture"":""amd64"",""author"":""Grafana team \u003chello@grafana.com\u003e"",""config"":{""User"":""grafana"",""ExposedPorts"":{""8081/tcp"":{}},""Env"":[""PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"",""NODE_VERSION=16.17.0"",""YARN_VERSION=1.22.19"",""CHROME_BIN=/usr/bin/chromium-browser"",""PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true"",""GF_PATHS_HOME=/usr/src/app"",""NODE_ENV=production""],""Entrypoint"":[""dumb-init"",""--""],""Cmd"":[""node"",""build/app.js"",""server"",""--config=config.json""],""WorkingDir"":""/usr/src/app"",""Labels"":{""maintainer"":""Grafana team \u003chello@grafana.com\u003e""}},""created"":""2022-08-30T17:49:17.724490416Z"",""os"":""linux""}"          3527096 "81c7db46-b9a4-4e52-85d3-c57a25ae27d4"  "sha256:b78baa730828d2f64e2d6a41f0314124147c3405c3890a444529bab6b1cebb6c"   "6809b473-11fb-11eb-93ee-e6a720a2df22"  "application/vnd.security.vulnerability.report; version=1.1"    "{""generated_at"":""2023-03-30T16:41:50.418786335Z"",""scanner"":{""name"":""Trivy"",""vendor"":""Aqua Security"",""version"":""v0.37.2""},""severity"":""High"",""vulnerabilities"":[]}"

NOTE: If the image first artifact is an index, it seems that the scanner reports are tied to the underlaying manifests. So I infer that this should always be the case.

So i have two cases. But even tho they are in the same repository, same image, the daily scan will not pick it up. I would like to know what kind of query is made here, so I can reproduce how many images would be scanned VS should be scanned. It seems that some filtering is applyed but i didn't manage to pick it up.

For consideration, This are my total artifact numbers:

SELECT count(*)
    FROM public.artifact AS a
    LEFT JOIN public.scan_report AS sr
        ON a.digest = sr.digest


SELECT count(*)
    FROM public.artifact AS a
    LEFT JOIN public.scan_report AS sr
        ON a.digest = sr.digest
--      a.media_type LIKE 'application/vnd.docker.container.image.v1+json'
        a.manifest_media_type LIKE 'application/vnd.docker.distribution.manifest.v2+json'

Note: number of artifacts obtained with commented parameter very similar

Can you help?

zyyw commented 1 year ago

@dioguerra To your question:

First and foremost, what are the projects that the scanner triggers the scan jobs?

The answer is

2. ScanAll command scans all artifacts independent of option selected by user?
zyyw commented 1 year ago

To your question:

while new pushed images are being actively scanned, the daily scanner does not pick up the Images with a Vulnerabilities error state: Even with Automatically scan images on push enabled.

The daily scanner (scanAll) runs independently of Automatically scan images on push. If it doesn't pick up the images with vulnerabilities, it may have a reason:

  1. Is this issue aquasecurity/trivy#3894 being fully addressed? Could you please narrow down the trivy-adapter pod to 1 and see if this issue still occurs.
  2. Please share the scan log if possible
dioguerra commented 1 year ago

Yes. 1. was fixed by using stateless .cache (i dropped the volumeMount manually from the statefulset).

For 2, i cannot do this as doing so and activating some sort of global scanning would start schedulling alot of jobs which would affect (sometimes halting) normal registry operation. Our experience shows that any overload of the jobService runners will affect registry image serving operations. https://github.com/goharbor/harbor/issues/17607

Do you have any other recommendations?

Note: I still need to validate this thread which seems to schedule scan jobs for the retrieved artifacts. Maybe there is something filtering happening here? https://github.com/goharbor/harbor/blob/f21b1481bb5ba3efb9e3c1dd8c4e704d9dcc44a1/src/controller/scan/base_controller.go#L387

github-actions[bot] commented 1 year ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dioguerra commented 1 year ago

Still happening. The automatically daily scan job keeps scanning random number of images. See below: image

dioguerra commented 1 year ago

I also notice that scan_reports with no vulnerabilties have null reports (also report_vulnerability_record is empty):

SELECT id, uuid, digest, registration_uuid, mime_type, report
    FROM public.scan_report
    WHERE report->>'generated_at' Like '2023-06-30%'

shows 112 entries, althout most probably this also account recently pushed images that where pushed to projects with scan on push option enabled. If i limit this to the autoscan start time (midnight) i see the number drop to 39 reports ( '2023-06-30T00%'). How is the Vulnerability Scan All report Generated?

dioguerra commented 1 year ago

Continuing from https://github.com/goharbor/harbor/issues/18455#issue-1647983689

The Scan is created with the ExecutionTriggerSchedule associated executionID, Since the last time I posted here, my artifact count went up. So for each artifact a scan call is created.

SELECT count(id)
    FROM public.artifact;

Inside the scan call we do some things:

At this point the artifacts variable should hold all artifact(s multiple if referenced) that are not Accessories or ImageIndex

Observation. Shouldn't this be:

supported := hasCapability(r, a)
if supported {
        artifacts = append(artifacts, a)  // If the artifact is not supported why do we add it to the list of artifacts?
    scannable = true
    return ar.ErrSkip // this artifact supported by the scanner, skip to walk its children

// What reason would require to add all other artifacts not supported by the scanner? Just to add a `not supported`?, we could add from here

So, image indexes are not supported by trivy, if we filter the remainder artifacts we get:

SELECT count(id)
    FROM public.artifact
    WHERE manifest_media_type IN ('application/vnd.oci.image.manifest.v1+json', 'application/vnd.docker.distribution.manifest.v2+json');
dioguerra commented 1 year ago

I think this might also be causing The ScanAll report to be incorrectly aggregated? Does it make sense to exit? Shouldn't all reports be aggregated?


        for _, group := range groupReports {
        if len(group) != 0 {
            reports = append(reports, group...)
        // else {
        //  // NOTE: If the artifact is OCI image, this happened when the artifact is not scanned,
        //  // but its children artifacts may scanned so return empty report
        //  return nil, nil
        // }
dioguerra commented 1 year ago

I'm seeing some errors when trivy tries to pull images from private repositories when triggered with scan_all (manually)?

It also seems that in version v2.7.x the job count is bigger (not sure this is due to the fact that trivy fails because s3 data is not there)

Does this fail because of some sort of timeout?

dioguerra commented 1 year ago

It seems that this problem is fixed in the version v2.7.x? I will need to actively scan the images to make sure:

This is how i'm validating.

  1. create a new clone instance of the production database
  2. Assign new harbor staging instance (v2.7) to the clone database.
  3. Manually and automatically trigger SCAN_ALL (without the s3 data trivy will fail to pull the images and return error). See below:
    2023-08-17T00:21:56Z [DEBUG] [/pkg/scan/job.go:376]: registration:
    2023-08-17T00:21:56Z [INFO] [/pkg/scan/job.go:387]: {
    "uuid": "6809b473-11fb-11eb-93ee-e6a720a2df22",
    "name": "Trivy",
    "description": "The Trivy scanner adapter",
    "url": "http://something:8080",
    "disabled": false,
    "is_default": true,
    "health": "healthy",
    "auth": "",
    "access_credential": "[HIDDEN]",
    "skip_certVerify": false,
    "use_internal_addr": true,
    "adapter": "Trivy",
    "vendor": "Aqua Security",
    "version": "v0.40.0",
    "create_time": "2020-10-19T13:08:24.486231Z",
    "update_time": "2023-08-16T11:50:05.497478Z"
    2023-08-17T00:21:56Z [DEBUG] [/pkg/scan/job.go:376]: scanRequest:
    2023-08-17T00:21:56Z [INFO] [/pkg/scan/job.go:387]: {
    "registry": {
    "url": "http://something:80",
    "authorization": "[HIDDEN]"
    "artifact": {
    "namespace_id": 19,
    "repository": "dtomasgu/gg/server",
    "tag": "latest",
    "digest": "sha256:ff6e6d5e2235adf383e184584a2bf9e881015f7ac49c9844bf2d505a674951ec",
    "mime_type": "application/vnd.docker.distribution.manifest.v2+json"
    2023-08-17T00:21:56Z [INFO] [/pkg/scan/job.go:167]: Report mime types: [application/vnd.security.vulnerability.report; version=1.1]
    2023-08-17T00:21:56Z [INFO] [/pkg/scan/job.go:224]: Get report for mime type: application/vnd.security.vulnerability.report; version=1.1
    2023-08-17T00:21:58Z [DEBUG] [/pkg/scan/job.go:237]: check scan report for mime application/vnd.security.vulnerability.report; version=1.1 at 2023/08/17 00:21:58
    2023-08-17T00:21:58Z [INFO] [/pkg/scan/job.go:245]: Report with mime type application/vnd.security.vulnerability.report; version=1.1 is not ready yet, retry after 5 seconds
    2023-08-17T00:22:03Z [DEBUG] [/pkg/scan/job.go:237]: check scan report for mime application/vnd.security.vulnerability.report; version=1.1 at 2023/08/17 00:22:03
    2023-08-17T00:22:03Z [INFO] [/pkg/scan/job.go:245]: Report with mime type application/vnd.security.vulnerability.report; version=1.1 is not ready yet, retry after 5 seconds
    2023-08-17T00:22:08Z [DEBUG] [/pkg/scan/job.go:237]: check scan report for mime application/vnd.security.vulnerability.report; version=1.1 at 2023/08/17 00:22:08
    2023-08-17T00:22:08Z [INFO] [/pkg/scan/job.go:245]: Report with mime type application/vnd.security.vulnerability.report; version=1.1 is not ready yet, retry after 5 seconds
    2023-08-17T00:22:13Z [DEBUG] [/pkg/scan/job.go:237]: check scan report for mime application/vnd.security.vulnerability.report; version=1.1 at 2023/08/17 00:22:13
    2023-08-17T00:22:13Z [ERROR] [/pkg/scan/job.go:294]: check scan report with mime type application/vnd.security.vulnerability.report; version=1.1: running trivy wrapper: running trivy: exit status 1: 2023-08-17T00:22:10.256Z INFO   Vulnerability scanning is enabled
    2023-08-17T00:22:10.347Z    FATAL  image scan error: scan error: unable to initialize a scanner: unable to initialize a docker scanner: 5 errors occurred:
    * unable to inspect the image (something:80/dtomasgu/gg/server@sha256:ff6e6d5e2235adf383e184584a2bf9e881015f7ac49c9844bf2d505a674951ec): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
    * unable to initialize Podman client: no podman socket found: stat podman/podman.sock: no such file or directory
    * containerd socket not found: /run/containerd/containerd.sock
    * GET http://something:80/v2/dtomasgu/gg/server/manifests/sha256:ff6e6d5e2235adf383e184584a2bf9e881015f7ac49c9844bf2d505a674951ec: MANIFEST_UNKNOWN: manifest unknown; map[Name:dtomasgu/gg/server Revision:sha256:ff6e6d5e2235adf383e184584a2bf9e881015f7ac49c9844bf2d505a674951ec]
    * GET http://something:80/v2/dtomasgu/gg/server/manifests/sha256:ff6e6d5e2235adf383e184584a2bf9e881015f7ac49c9844bf2d505a674951ec: MANIFEST_UNKNOWN: manifest unknown; map[Name:dtomasgu/gg/server Revision:sha256:ff6e6d5e2235adf383e184584a2bf9e881015f7ac49c9844bf2d505a674951ec]

: general response handler: unexpected status code: 500, expected: 200

Here are some information on the scan jobs:
The database (which is not changing) contains 32255 artifacts:
SELECT count(*)
    FROM public.artifact

From this artifacts, 28639 should be scannable by the trivy scan:

SELECT count(*)
    FROM public.artifact AS a
    LEFT JOIN public.scan_report AS sr
        ON a.digest = sr.digest
        a.manifest_media_type IN ('application/vnd.oci.image.manifest.v1+json', 'application/vnd.docker.distribution.manifest.v2+json')

Below its the last SCAN_ALL triggered (manually and schedule in the following order):

SELECT id, vendor_type, vendor_id, status, status_message, trigger, extra_attrs, start_time, end_time, revision, update_time
    FROM public.execution
    WHERE vendor_type = 'SCAN_ALL'
    LIMIT 1000;

Harbor v2.7 - automatic scan Harbor v2.7 - manual scan Harbor v2.5 - automatic scan Harbor v2.5 - automatic scan Harbor v2.5 - automatic scan

7148930 "SCAN_ALL"  0   "Running"       "SCHEDULE"  "{""summary"":{""total_count"":24745,""submit_count"":18727,""conflict_count"":945,""precondition_count"":0,""unsupport_count"":5073,""unknow_count"":0}}"  "2023-08-17 00:00:01.962214"        60360   "2023-08-17 09:02:32"
7148901 "SCAN_ALL"  0   "Error"     "MANUAL"    "{""summary"":{""total_count"":24855,""submit_count"":5508,""conflict_count"":408,""precondition_count"":0,""unsupport_count"":18939,""unknow_count"":0}}"  "2023-08-16 13:44:17.284105"    "2023-08-16 17:13:56"   16613   "2023-08-16 17:13:56"
7148727 "SCAN_ALL"  0   "Error"     "SCHEDULE"  "{""summary"":{""total_count"":1996,""submit_count"":1898,""conflict_count"":35,""precondition_count"":0,""unsupport_count"":61,""unknow_count"":2}}"   "2023-08-10 00:00:01.754736"    "2023-08-10 00:58:43"   4153    "2023-08-10 00:58:43"
7144700 "SCAN_ALL"  0   "Error"     "SCHEDULE"  "{""summary"":{""total_count"":147,""submit_count"":129,""conflict_count"":0,""precondition_count"":0,""unsupport_count"":17,""unknow_count"":1}}"  "2023-08-09 00:00:03.366405"    "2023-08-09 00:02:22"   260 "2023-08-09 00:02:22"
7140768 "SCAN_ALL"  0   "Error"     "SCHEDULE"  "{""summary"":{""total_count"":1339,""submit_count"":1276,""conflict_count"":31,""precondition_count"":0,""unsupport_count"":32,""unknow_count"":0}}"   "2023-08-08 00:00:03.142852"    "2023-08-08 00:39:22"   2835    "2023-08-08 00:39:22"

As we can observe, the total_count approaches the artifacts that trivy supports (~25k vs ~29k). Although I would expect the total_count to be equal to the total artifacts and the submit count equal to the artifacts scanned by trivy.

I will try and test patch https://github.com/goharbor/harbor/pull/18931 and https://github.com/goharbor/harbor/pull/18943 to see the difference

dioguerra commented 1 year ago

It seems that total_count seems much more consistent but even with the patches the reported scans show only a fraction of the total artifacts.

In the table below, the last 2 reports where ran with the core v2.7 with the respective patches above

7148968 "SCAN_ALL"  0   "Error"     "SCHEDULE"  "{""summary"":{""total_count"":24682,""submit_count"":2658,""conflict_count"":84,""precondition_count"":0,""unsupport_count"":21940,""unknow_count"":0}}"   "2023-08-18 00:00:02.970861"    "2023-08-18 00:55:58"   5864    "2023-08-18 00:55:59.049953"
7148942 "SCAN_ALL"  0   "Error"     "MANUAL"    "{""summary"":{""total_count"":24745,""submit_count"":2761,""conflict_count"":89,""precondition_count"":0,""unsupport_count"":21895,""unknow_count"":0}}"   "2023-08-17 16:29:48.349199"    "2023-08-17 17:28:37"   6106    "2023-08-17 17:28:37"
7148941 "SCAN_ALL"  0   "Error"     "MANUAL"    "{""summary"":{""total_count"":24745,""submit_count"":10284,""conflict_count"":110,""precondition_count"":0,""unsupport_count"":14351,""unknow_count"":0}}" "2023-08-17 09:13:46.026438"    "2023-08-17 13:14:01"   26974   "2023-08-17 13:14:01"
7148930 "SCAN_ALL"  0   "Error"     "SCHEDULE"  "{""summary"":{""total_count"":24745,""submit_count"":18727,""conflict_count"":945,""precondition_count"":0,""unsupport_count"":5073,""unknow_count"":0}}"  "2023-08-17 00:00:01.962214"    "2023-08-17 09:13:17"   61668   "2023-08-17 09:13:17"
7148901 "SCAN_ALL"  0   "Error"     "MANUAL"    "{""summary"":{""total_count"":24855,""submit_count"":5508,""conflict_count"":408,""precondition_count"":0,""unsupport_count"":18939,""unknow_count"":0}}"  "2023-08-16 13:44:17.284105"    "2023-08-16 17:13:56"   16613   "2023-08-16 17:13:56"

NOTE: I disabled the GC but it seems that still some artifacts got dropped? (24745 -> 24682)

Still this dosen't tell us nothing of what is failing. Checking the tasks for the currect execution id presents the same number of tasks and reports presented in the scan_all dashboard results (as we had confirmed before).

    FROM public.task
    WHERE execution_id = 7148968

Is this an issue submitting the task jobs from the SCAN_ALL artifact loop? Also, a last observation between the submit_count and the actual tasks assigned to the (latest) SCAN_ALL execution id they dont match:


You might say that this is because of the patches? But the same still happens with the previous original v2.7 version: For

7148901 "SCAN_ALL"  0   "Error"     "MANUAL"    "{""summary"":{""total_count"":24855,""submit_count"":5508,""conflict_count"":408,""precondition_count"":0,""unsupport_count"":18939,""unknow_count"":0}}"  "2023-08-16 13:44:17.284105"    "2023-08-16 17:13:56"   16613   "2023-08-16 17:13:56"


And for

7148930 "SCAN_ALL"  0   "Running"       "SCHEDULE"  "{""summary"":{""total_count"":24745,""submit_count"":18727,""conflict_count"":945,""precondition_count"":0,""unsupport_count"":5073,""unknow_count"":0}}"  "2023-08-17 00:00:01.962214"        60360   "2023-08-17 09:02:32"

which i stopped as the total scheduled tasks where not increasing.


dioguerra commented 1 year ago

My logs show 22k occurences of this error https://github.com/goharbor/harbor/blob/f21b1481bb5ba3efb9e3c1dd8c4e704d9dcc44a1/src/controller/scan/base_controller.go#L392 for the latest scan image

which seems about right. Althought the full error is

2023-08-18T00:55:59Z [ERROR] [/controller/scan/base_controller.go:391]: failed to scan artifact someartifact@sha, error the configured scanner Trivy does not support scanning artifact with mime type application/vnd.oci.image.manifest.v1+json

This is totally wrong. So what is the error exactly?

dioguerra commented 1 year ago


dioguerra commented 1 year ago

I think the error comes from here. It points to a context timeout.

2023-08-18T00:55:59Z [ERROR] [/controller/scanner/base_controller.go:299][error="v1 client: get metadata: Get "http://trivy:8080/api/v1/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"]: failed to ping scanner
2023-08-18T00:55:59Z [ERROR] [/controller/scanner/base_controller.go:265]: api controller: get project scanner: scanner controller: ping: v1 client: get metadata: Get "trivy:8080/api/v1/metadata": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

[299] points to https://github.com/goharbor/harbor/blob/main/src/controller/scanner/base_controller.go#L301 [266] points to https://github.com/goharbor/harbor/blob/main/src/controller/scanner/base_controller.go#L265C2-L265C30

github-actions[bot] commented 1 year ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

github-actions[bot] commented 12 months ago

This issue was closed because it has been stalled for 30 days with no activity. If this issue is still relevant, please re-open a new issue.