goharbor / harbor

An open source trusted cloud native registry project that stores, signs, and scans content.
https://goharbor.io
Apache License 2.0
23.99k stars 4.75k forks source link

how to re-sync database and s3 contents #21093

Open mdavid01 opened 3 hours ago

mdavid01 commented 3 hours ago

Hi team: we believe our postgresql database artifact contents are out of sync with what's actually in the S3 bucket. our s3 bucket size is 143TB. From our user's standpoint, the Harbor UI, swagger, docker push/pull all functioning as expected. But there's no way we have 143TB of active projects/repos/artifacts.

  1. Is there a tool or method available that we can use to identify disconnects between the s3 content and the database content? if no tool or method, can you offer how you might go about finding the disconnects?
  2. Does the s3 bucket contain image scan results (if so, it could explain our growth)?
  3. What does the table 'Artifacts_trash' contain? Are they input to any harbor process or job?
  4. In the 'blob' table, we have ~6600 records with status 'delete'. What are those records? Are they input to any harbor process or job?

Thanks.

Vad1mo commented 3 hours ago

This is rather unusual, but I think that this might have happened, for example when the GC can delete the files.

  1. no such tool exist, IMO it has to be created so that it iterates over the Harbor (db) and S3 and finds layers and blobs and manifests not in Harbor but on S3.
  2. I am not sure, we had some functionality storing data in S3. but you would see it in the bucket, as its top-level next to docker ..

Did you run the GC, what is the outcome?