goharbor / harbor

An open source trusted cloud native registry project that stores, signs, and scans content.
https://goharbor.io
Apache License 2.0
23.71k stars 4.73k forks source link

GC fails after deleting many tags from same repository #15807

Open dkulchinsky opened 2 years ago

dkulchinsky commented 2 years ago

Expected behavior and actual behavior: We deleted over 20,000 tags from a repository (these tags are auto generated during our periodic CI job to test the registry and CI), we expected to run GC to get the related blobs and manifests cleaned up, but now GC fails consistently with the following error:

2021-10-18T12:59:19Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:236]: 46566 blobs and 23266 manifests eligible for deletion
2021-10-18T12:59:19Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:237]: The GC could free up 4094 MB space, the size is a rough estimation.
2021-10-18T12:59:19Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:261]: delete the manifest with registry v2 API: <project>/<repo>/demo-go-app, application/vnd.docker.distribution.manifest.v2+json, sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8
2021-10-18T13:22:10Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:264]: failed to delete manifest with v2 API, <project>/<repo>/demo-go-app, sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8, retry timeout: http status code: 500, body: {"errors":[{"code":"UNKNOWN","message":"unknown error","detail":{}}]}
2021-10-18T13:22:10Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:166]: failed to execute GC job at sweep phase, error: failed to delete manifest with v2 API: <project>/<repo>/demo-go-app, sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8: retry timeout: http status code: 500, body: {"errors":[{"code":"UNKNOWN","message":"unknown error","detail":{}}]}

we caught the following log in the registry.

DELETE request:

time="2021-10-18T12:59:19.582953425Z" level=info msg="authorized request" go.version=go1.15.12 http.request.host="harbor-registry:5000" http.request.id=3c90b874-a989-41c1-b60f-e8c72c447002 http.request.method=DELETE http.request.remoteaddr="127.0.0.1:38784" http.request.uri="/v2/<project>/<repo>/demo-go-app/manifests/sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8" http.request.useragent=harbor-registry-client vars.name="<project>/<repo>/demo-go-app" vars.reference="sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8"

and the 500 error ~20 minutes later:

time="2021-10-18T13:22:10.756233562Z" level=error msg="response completed with error" auth.user.name="harbor_registry_user" err.code=unknown err.message="invalid checksum digest format" go.version=go1.15.12 http.request.host="harbor-registry:5000" http.request.id=3c90b874-a989-41c1-b60f-e8c72c447002 http.request.method=DELETE http.request.remoteaddr="127.0.0.1:38784" http.request.uri="/v2/<project>/<repo>/demo-go-app/manifests/sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8" http.request.useragent=harbor-registry-client http.response.contenttype="application/json; charset=utf-8" http.response.duration=22m51.241153516s http.response.status=500 http.response.written=70 vars.name="<project>/<repo>/demo-go-app" vars.reference="sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8" 

Steps to reproduce the problem:

  1. Use GCS for registry backend storage
  2. generate many (thousands?) of tags for the same repository
  3. delete all/most tags (we use a retention policy)
  4. run GC and observe the above errors

Versions: Please specify the versions of following systems.

Additional context:

dkulchinsky commented 2 years ago

this is potentially related to https://github.com/goharbor/harbor/issues/12948 which we already hit in the past, but looks like there's no progress there, so hoping there's some new insight.

wy65701436 commented 2 years ago

The performance issue was happended at the distribution side to lookup and remove tags, we will do some investigation on this, but no specific plan so far.

dkulchinsky commented 2 years ago

The performance issue was happended at the distribution side to lookup and remove tags, we will do some investigation on this, but no specific plan so far.

Thanks @wy65701436, is there a workaround? I was thinking to delete these artifacts from artifact_trash table so GC won't pick them up? I realize that we will end with these manifests and blobs in GCS, but given the situation it looks like it's better than the alternative where GC is broken completely.

I realize this is sort of a corner case, but I can imagine that others may end up in a similar situation, so hopefully you folks can find some cycles soon to take a look at fixing this šŸ™šŸ¼

dkulchinsky commented 2 years ago

@wy65701436 on another instance of Harbor we run, we noticed that a repository with ~4000 tags takes about 2 minutes to delete a single manifest.

It seems to me like there's a significant performance issue during GC for repositories that have several thousand tags and more.

for example:

2021-10-20T11:10:52Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:261]: delete the manifest with registry v2 API: <project>/<repo>, application/vnd.docker.distribution.manifest.v2+json, sha256:fe582557fdb5eb00ca114e263784be44661f35ba1f7f15c764f0f43567a69939
2021-10-20T11:12:36Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:273]: delete manifest from storage: sha256:fe582557fdb5eb00ca114e263784be44661f35ba1f7f15c764f0f43567a69939

2021-10-20T11:12:37Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:261]: delete the manifest with registry v2 API: <project>/<repo>, application/vnd.docker.distribution.manifest.v2+json, sha256:c7faa9c6517dd640432b9172b832284b19a10324cde9782c1f16a793d8a9d041
2021-10-20T11:14:20Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:273]: delete manifest from storage: sha256:c7faa9c6517dd640432b9172b832284b19a10324cde9782c1f16a793d8a9d041

it also appears that these operations are sequential, perhaps some form of parallelism can be introduced to speed this up? though I think the root constraint needs to be addressed to be able to support any significant deployments.

heww commented 2 years ago

An option to resolve this problem: the artifact in the distribution is untagged, the tag is managed in the harbor core side.

wy65701436 commented 2 years ago

since in v2.5, we introduce the skip for deletion failure. This could workaround for the timeout.

github-actions[bot] commented 2 years ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dkulchinsky commented 2 years ago

still relevant

github-actions[bot] commented 2 years ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dkulchinsky commented 2 years ago

still an issue

github-actions[bot] commented 1 year ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dkulchinsky commented 1 year ago

definitely not stale

github-actions[bot] commented 1 year ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dkulchinsky commented 1 year ago

not stale

github-actions[bot] commented 1 year ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dkulchinsky commented 1 year ago

not stale

github-actions[bot] commented 1 year ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dkulchinsky commented 1 year ago

not stale

github-actions[bot] commented 1 year ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dkulchinsky commented 1 year ago

not stale

github-actions[bot] commented 11 months ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dkulchinsky commented 11 months ago

not stale

github-actions[bot] commented 9 months ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dkulchinsky commented 9 months ago

not stale

github-actions[bot] commented 7 months ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dkulchinsky commented 7 months ago

not stale

github-actions[bot] commented 5 months ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dkulchinsky commented 5 months ago

not stale

kingnarmer commented 5 months ago

I am getting same error with 2.10.2.

github-actions[bot] commented 3 months ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dkulchinsky commented 3 months ago

not stale

kingnarmer commented 2 months ago

not stale.

github-actions[bot] commented 2 weeks ago

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

dkulchinsky commented 1 week ago

not stale

Thesuperkingofsnakes22 commented 1 week ago

Expected behavior and actual behavior: We deleted over 20,000 tags from a repository (these tags are auto generated during our periodic CI job to test the registry and CI), we expected to run GC to get the related blobs and manifests cleaned up, but now GC fails consistently with the following error:

2021-10-18T12:59:19Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:236]: 46566 blobs and 23266 manifests eligible for deletion
2021-10-18T12:59:19Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:237]: The GC could free up 4094 MB space, the size is a rough estimation.
2021-10-18T12:59:19Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:261]: delete the manifest with registry v2 API: <project>/<repo>/demo-go-app, application/vnd.docker.distribution.manifest.v2+json, sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8
2021-10-18T13:22:10Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:264]: failed to delete manifest with v2 API, <project>/<repo>/demo-go-app, sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8, retry timeout: http status code: 500, body: {"errors":[{"code":"UNKNOWN","message":"unknown error","detail":{}}]}
2021-10-18T13:22:10Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:166]: failed to execute GC job at sweep phase, error: failed to delete manifest with v2 API: <project>/<repo>/demo-go-app, sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8: retry timeout: http status code: 500, body: {"errors":[{"code":"UNKNOWN","message":"unknown error","detail":{}}]}

we caught the following log in the registry.

DELETE request:

time="2021-10-18T12:59:19.582953425Z" level=info msg="authorized request" go.version=go1.15.12 http.request.host="harbor-registry:5000" http.request.id=3c90b874-a989-41c1-b60f-e8c72c447002 http.request.method=DELETE http.request.remoteaddr="127.0.0.1:38784" http.request.uri="/v2/<project>/<repo>/demo-go-app/manifests/sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8" http.request.useragent=harbor-registry-client vars.name="<project>/<repo>/demo-go-app" vars.reference="sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8"

and the 500 error ~20 minutes later:

time="2021-10-18T13:22:10.756233562Z" level=error msg="response completed with error" auth.user.name="harbor_registry_user" err.code=unknown err.message="invalid checksum digest format" go.version=go1.15.12 http.request.host="harbor-registry:5000" http.request.id=3c90b874-a989-41c1-b60f-e8c72c447002 http.request.method=DELETE http.request.remoteaddr="127.0.0.1:38784" http.request.uri="/v2/<project>/<repo>/demo-go-app/manifests/sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8" http.request.useragent=harbor-registry-client http.response.contenttype="application/json; charset=utf-8" http.response.duration=22m51.241153516s http.response.status=500 http.response.written=70 vars.name="<project>/<repo>/demo-go-app" vars.reference="sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8" 

Steps to reproduce the problem:

  1. Use GCS for registry backend storage
  2. generate many (thousands?) of tags for the same repository
  3. delete all/most tags (we use a retention policy)
  4. run GC and observe the above errors

Versions: Please specify the versions of following systems.

  • harbor version: v2.3.3
  • docker engine version: N/A
  • docker-compose version: N/A

Additional context:

  • We use GCS for registry backend storage