project-zot / zot

zot - A scale-out production-ready vendor-neutral OCI-native container image/artifact registry (purely based on OCI Distribution Specification)
https://zotregistry.dev
Apache License 2.0
923 stars 98 forks source link

[Bug]: GC and scrub plugin do not cleanup malformed blobs #2598

Closed shoneefd closed 1 month ago

shoneefd commented 2 months ago

zot version

2.0.4

Describe the bug

We have a registry server which has several storage blobs in a malformed state with no live blobs and large files trapped permanently in the .uploads subdirectory. Here is an example of the observed directory structure:

root@registry-na-01:/storage/zot/sce/076d3606-33c3-41ec-95c4-9805c6be1f3c# ls -a1R
.:
.
..
blobs
index.json
oci-layout
.uploads

./blobs:
.
..
sha256

./blobs/sha256:
.
..

./.uploads:
.
..
117ab70a-14cf-408f-9ad0-ca7ab73505b2
dd6167a3-d1f0-4e3c-b0ac-d8605eba5399

In total, eleven blocks of this malformed type are taking up nearly 100 GB of junk space. We have enabled both the GC and the scrub plugin, and neither appears capable of clearing out these blocks. Here are the relevant parts of our config.json:

"distSpecVersion": "1.1.0-dev",
"storage": {
  "rootDirectory": "/storage/zot",
  "dedupe": "true",
  "gc": "true",
  "gcDelay": "1h",
  "gcInterval": "24h"
},
"scrub": {
  "enable": true,
  "interval": "2h"
},

To reproduce

We suspect this issue occurs when a blob upload in-progress is canceled, but have not yet reproduced

Expected behavior

The GC should clean up these blocks.

Screenshots

No response

Additional context

No response

rchincha commented 2 months ago

@shoneefd thanks for reporting this. How exactly are you using zot in your setup?

shoneefd commented 2 months ago

@rchincha We have two separate zot servers in each of our development and production environments. One is the primary; we use it as an OCI registry for images needed by containerized workloads running on, in the case of the production environment, thousands or more machines. It is this primary server to which, up to now, all of our direct OCI operations have been directed, and thus all of these malformed blobs are on that server. The second server is a mirror, which we keep synchronized to the primary server by way of the sync plugin, which does not retrieve these malformed blobs. (As a result, at least in the development environment, the primary server currently has almost 100GB more data in its storage than our mirror, which should be in sync.)

Originally, the mirror was only intended as a storage backup of the contents of the primary server. However, as traffic has scaled up we have decided to start using both servers as endpoints for the registry. This makes it doubly important that GC cleanup happens properly, as otherwise this junk data will not only pile up, it will bring our servers even farther out of sync, making it more difficult to tell which inconsistencies are caused by junk data and which are cause for more concern.

rchincha commented 2 months ago

@shoneefd https://github.com/project-zot/zot/pull/2599 has been merged, pls. go ahead and build zot from main and try. This will be included in the next release.

If you confirm this is fixed, pls close the issue.

rchincha commented 1 month ago

@shoneefd closing this issue since the PR is now merged and we haven't heard anything back.

Pls. re-open if you disagree.