goharbor / harbor

An open source trusted cloud native registry project that stores, signs, and scans content.
https://goharbor.io
Apache License 2.0
24.22k stars 4.77k forks source link

how to re-sync database and s3 contents #21093

Open mdavid01 opened 4 weeks ago

mdavid01 commented 4 weeks ago

Hi team: we believe our postgresql database artifact contents are out of sync with what's actually in the S3 bucket. our s3 bucket size is 143TB. From our user's standpoint, the Harbor UI, swagger, docker push/pull all functioning as expected. But there's no way we have 143TB of active projects/repos/artifacts.

  1. Is there a tool or method available that we can use to identify disconnects between the s3 content and the database content? if no tool or method, can you offer how you might go about finding the disconnects?
  2. Does the s3 bucket contain image scan results (if so, it could explain our growth)?
  3. What does the table 'Artifacts_trash' contain? Are they input to any harbor process or job?
  4. In the 'blob' table, we have ~6600 records with status 'delete'. What are those records? Are they input to any harbor process or job?

Thanks.

Vad1mo commented 4 weeks ago

This is rather unusual, but I think that this might have happened, for example when the GC can delete the files.

  1. no such tool exist, IMO it has to be created so that it iterates over the Harbor (db) and S3 and finds layers and blobs and manifests not in Harbor but on S3.
  2. I am not sure, we had some functionality storing data in S3. but you would see it in the bucket, as its top-level next to docker ..

Did you run the GC, what is the outcome?

mdavid01 commented 4 weeks ago

Howdy Vadim. First, thank you for the very quick response. RE GC: our GC is scheduled to run daily and cannot complete within 24 hours – maybe even within 48 hours. We are just now working on how the DB can tell us how long a single GC runs. As of yesterday, we had 30 GC jobs pending. We also had 139 pending EXECUTION_SWEEP jobs pending. Since our operation is 24x7, we cannot stop the service, say, on the weekends. We significantly upsized our K8s pods on Tuesday. That provided tremendous improvement for teams performing push/pull/scan functions of large images. However, no observable relief for GC. Looking in the k8s pods, we could not tell if GC or EXECUTION_SWEEPS were actually running. So we Stopped the queues and let those functions re-schedule themselves.

I believe GC may be impacted by the disconnect between the s3 bucket and the DB. It would save us a lot of time if you could map for us where the S3 keys for blobs, layers, manifests, artifacts are found in the database. I know that’s a lot to ask but our AWS costs are drawing a lot of attention because it proportionately impacts AWS backups, cloudwatch, and events cost. We’ve discussed creating the tool and contributing it to GoHarbor.

Our leadership has asked about the possibility of a collaborative session with your SME(s) on this subject. If that’s possible, please let me know. Email is the best communications conduit for the moment. Any info or direction you can provide is greatly appreciated.

Thanks.

Michael David @.**@. Lockheed Martin – Enterprise IT Orlando, Fl Office: 407-306-1392

From: Vadim Bauer @.> Sent: Thursday, October 24, 2024 11:56 AM To: goharbor/harbor @.> Cc: David, Michael (US) @.>; Author @.> Subject: EXTERNAL: Re: [goharbor/harbor] how to re-sync database and s3 contents (Issue #21093)

This is rather unusual, but I think that this might have happened, for example when the GC can delete the files.

  1. no such tool exist, IMO it has to be created so that it iterates over the Harbor (db) and S3 and finds layers and blobs and manifests not in Harbor but on S3.
  2. I am not sure, we had some functionality storing data in S3. but you would see it in the bucket, as its top level next to docker

Did you run the GC?

— Reply to this email directly, view it on GitHubhttps://github.com/goharbor/harbor/issues/21093#issuecomment-2435671782, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAYT44U5QINWRWHSJYGCCGTZ5EKBBAVCNFSM6AAAAABQRMDFRSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZVGY3TCNZYGI. You are receiving this because you authored the thread.Message ID: @.**@.>>

wy65701436 commented 4 weeks ago

@mdavid01

Basically, these are pretty technical questions. From an open-source perspective, the best way to find answers is to check the source code and the design documents.

1, Harbor DB doesn't connect directly to S3; we don't have that tool, but you can look at the logs to help troubleshoot issues. 2, No, the results are stored in the database. 3 & 4 are about non-blocking garbage collection. For more details, please check the design doc and the source code.

By the way, we have regular community meetings. You’re welcome to join us if you want to discuss anything related to Harbor!

mdavid01 commented 3 weeks ago

Thanks, we have stepped into the code just a bit. Where are the design docs located?

Michael David @.**@. Lockheed Martin – Enterprise IT Orlando, Fl Office: 407-306-1392

From: Wang Yan @.> Sent: Friday, October 25, 2024 3:05 AM To: goharbor/harbor @.> Cc: David, Michael (US) @.>; Mention @.> Subject: EXTERNAL: Re: [goharbor/harbor] how to re-sync database and s3 contents (Issue #21093)

@mdavid01https://github.com/mdavid01

Basically, these are pretty technical questions. From an open-source perspective, the best way to find answers is to check the source code and the design documents.

1, Harbor DB doesn't connect directly to S3; we don't have that tool, but you can look at the logs to help troubleshoot issues. 2, No, the results are stored in the database. 3 & 4 are about non-blocking garbage collection. For more details, please check the design doc and the source code.

By the way, we have regular community meetings. You’re welcome to join us if you want to discuss anything related to Harbor!

— Reply to this email directly, view it on GitHubhttps://github.com/goharbor/harbor/issues/21093#issuecomment-2437049385, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAYT44S434AM3DZ5QGDV5JLZ5HUR5AVCNFSM6AAAAABQRMDFRSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZXGA2DSMZYGU. You are receiving this because you were mentioned.Message ID: @.***>

stonezdj commented 3 weeks ago

Is it possible that other registry use the same s3 bucket?

jan-kantert commented 2 weeks ago

We see very similar behavior in our installation. If you look at historic issues this seems to be quite common. Currently, the common workaround seems to be: https://github.com/goharbor/harbor/issues/20606#issuecomment-2403367842