Large Garbage Collections Slow and Can Easily Fail Indefinitely

isugimpy commented 8 months ago

Apologies for making this a general Issue type, but this isn't exactly a bug, and it isn't exactly a proposal for a feature either.

On very large and high churn Harbor instances, garbage collection can become very slow or reach a point of indefinite failure due to OOM, requiring human intervention. This intervention may be in the form of adjusting resources, or purging resources from the backing datastore. Additionally, garbage collection cycles are all-or-nothing, requiring a full pass of marking before any deletion occurs.

Incremental garbage collection which purges chunks at a time rather than needing to do the full mark cycle every time would likely be a decent solution here, but may not be in line with goals of the project. There are various ways this could be implemented, but I would lean toward a concurrent mark/sweep approach as that'd be the closest to the current implementation. The run step could begin the mark in a goroutine, pushing chunked results back on a channel, and those chunks could be fed to a number of workers to execute deletion in near-realtime, allowing deletions to occur without collecting the full data. This would be beneficial from a speed perspective, but also would act as a mechanism to allow for an incomplete GC to be retried with a smaller total dataset, eventually getting fully caught up.

I'm not sure if the maintainers have something else in mind, or if I'm the only one dealing with the level of scale here, but this has been a real pain point that I'd love to see a solution to.

stonezdj commented 8 months ago

What is your Harbor's scale, how many artifacts are in total?

isugimpy commented 8 months ago

The instance that spawned this has ~150k artifacts right now across 648 repositories in one project. Backing datastore is AWS S3, metadata is in AWS Aurora Postgres and AWS Elasticache Redis.

wy65701436 commented 8 months ago

Have you tried performing multiple deletions, such as setting the number of workers to 5? BTW, what version of harbor are you using?

isugimpy commented 8 months ago

Apologies, I didn't see the reply here until just now. This is on 2.8.4. I could likely see about updating us to 2.9.0+ to get the GC changes there, but looking at the code it's not clear if it'd solve the issue we're having. Edit: Just to be clear, we never reach the point where deletion even happens. The mark behavior in the garbage collector runs for days (or in the case I describe below over two weeks) before failing for any number of reasons, and no deletions are performed. This is why I'm suggesting moving to a more incremental model, where deletions can happen much earlier in the process, because this will eventually allow catching up.

Can you clarify what you mean about performing multiple deletions? Our jobservice pods run 10 workers each, but exactly one garbage collection task is running at any given time.

Since I last posted, we actually let a garbage collection task run for 18 days, only to find the following error happened about an hour ago, leaving us with nothing deleted from the bucket. We've got ~10TB of data in our actual registry, and ~75TB in the bucket, due to the GC failures just compounding over time.

ERROR [/jobservice/runner/redis.go:167]: Run job GARBAGE_COLLECTION:c4bc6b26560c68ef94b423e0 error: runtime error: interface conversion: interface {} is nil, not []uint8; stack: goroutine 74 [running]:
github.com/goharbor/harbor/src/jobservice/runner.(*RedisJob).Run.func2()
    /harbor/src/jobservice/runner/redis.go:165 +0x7b
panic({0x14c0320, 0xc045059710})
    /usr/local/go/src/runtime/panic.go:890 +0x263
github.com/goharbor/harbor/src/jobservice/job.getStatus({0x1933138?, 0xc04517c318?}, {0xc0450cab40?, 0x18?})
    /harbor/src/jobservice/job/tracker.go:619 +0x194
github.com/goharbor/harbor/src/jobservice/job.(*basicTracker).Status(0xc03691c3c0)
    /harbor/src/jobservice/job/tracker.go:193 +0xcf
github.com/goharbor/harbor/src/jobservice/job/impl.(*Context).OPCommand(0xc038b2e3d0?)
    /harbor/src/jobservice/job/impl/context.go:165 +0x26
github.com/goharbor/harbor/src/jobservice/job/impl/gc.(*GarbageCollector).shouldStop(0xc04d648fa0?, {0x1934e90?, 0xc038a4ec80?})
    /harbor/src/jobservice/job/impl/gc/garbage_collection.go:644 +0x27
github.com/goharbor/harbor/src/jobservice/job/impl/gc.(*GarbageCollector).mark(0xc045164f30, {0x1934e90, 0xc038a4ec80})
    /harbor/src/jobservice/job/impl/gc/garbage_collection.go:224 +0x41e
github.com/goharbor/harbor/src/jobservice/job/impl/gc.(*GarbageCollector).Run(0xc045164f30, {0x1934e90, 0xc038a4ec80}, 0x3?)
    /harbor/src/jobservice/job/impl/gc/garbage_collection.go:156 +0x70
github.com/goharbor/harbor/src/jobservice/runner.(*RedisJob).Run(0xc0008e6e10, 0xc00b5d6600)
    /harbor/src/jobservice/runner/redis.go:231 +0x11d8
github.com/goharbor/harbor/src/jobservice/worker/cworker.(*basicWorker).registerJob.func2(0x23ddda0?)
    /harbor/src/jobservice/worker/cworker/c_worker.go:439 +0x25
github.com/gocraft/work.runJob.func1()
    /harbor/src/vendor/github.com/gocraft/work/run.go:31 +0x3f0
github.com/goharbor/harbor/src/jobservice/worker/cworker.(*workerContext).logJob(0x2?, 0xc00b5d6600, 0xc03691c370)
    /harbor/src/jobservice/worker/cworker/c_worker.go:79 +0x136
reflect.Value.call({0x1489320?, 0x1743370?, 0xc00b5d6600?}, {0x169544e, 0x4}, {0xc05b611d48, 0x3, 0x461ea5?})
    /usr/local/go/src/reflect/value.go:586 +0xb0b
reflect.Value.Call({0x1489320?, 0x1743370?, 0xc05b611cd8?}, {0xc05b611d48?, 0x16?, 0x12a8e0?})
    /usr/local/go/src/reflect/value.go:370 +0xbc
github.com/gocraft/work.runJob.func1()
    /harbor/src/vendor/github.com/gocraft/work/run.go:23 +0x215
github.com/gocraft/work.runJob(0xc00b5d6600, {0x1945ac8, 0x14ab100}, {0xc000014270, 0x1, 0x1}, 0xc000528240)
    /harbor/src/vendor/github.com/gocraft/work/run.go:51 +0x22b
github.com/gocraft/work.(*worker).processJob(0xc000792900, 0xc00b5d6600)
    /harbor/src/vendor/github.com/gocraft/work/worker.go:198 +0x20c
github.com/gocraft/work.(*worker).loop(0xc000792900)
    /harbor/src/vendor/github.com/gocraft/work/worker.go:119 +0x1cd
created by github.com/gocraft/work.(*worker).start
    /harbor/src/vendor/github.com/gocraft/work/worker.go:78 +0x5a
ERROR [/jobservice/runner/redis.go:123]: Job 'GARBAGE_COLLECTION:c4bc6b26560c68ef94b423e0' exit with error: runtime error: interface conversion: interface {} is nil, not []uint8; stack: goroutine 74 [running]:
github.com/goharbor/harbor/src/jobservice/runner.(*RedisJob).Run.func2()
    /harbor/src/jobservice/runner/redis.go:165 +0x7b
panic({0x14c0320, 0xc045059710})
    /usr/local/go/src/runtime/panic.go:890 +0x263
github.com/goharbor/harbor/src/jobservice/job.getStatus({0x1933138?, 0xc04517c318?}, {0xc0450cab40?, 0x18?})
    /harbor/src/jobservice/job/tracker.go:619 +0x194
github.com/goharbor/harbor/src/jobservice/job.(*basicTracker).Status(0xc03691c3c0)
    /harbor/src/jobservice/job/tracker.go:193 +0xcf
github.com/goharbor/harbor/src/jobservice/job/impl.(*Context).OPCommand(0xc038b2e3d0?)
    /harbor/src/jobservice/job/impl/context.go:165 +0x26
github.com/goharbor/harbor/src/jobservice/job/impl/gc.(*GarbageCollector).shouldStop(0xc04d648fa0?, {0x1934e90?, 0xc038a4ec80?})
    /harbor/src/jobservice/job/impl/gc/garbage_collection.go:644 +0x27
github.com/goharbor/harbor/src/jobservice/job/impl/gc.(*GarbageCollector).mark(0xc045164f30, {0x1934e90, 0xc038a4ec80})
    /harbor/src/jobservice/job/impl/gc/garbage_collection.go:224 +0x41e
github.com/goharbor/harbor/src/jobservice/job/impl/gc.(*GarbageCollector).Run(0xc045164f30, {0x1934e90, 0xc038a4ec80}, 0x3?)
    /harbor/src/jobservice/job/impl/gc/garbage_collection.go:156 +0x70
github.com/goharbor/harbor/src/jobservice/runner.(*RedisJob).Run(0xc0008e6e10, 0xc00b5d6600)
    /harbor/src/jobservice/runner/redis.go:231 +0x11d8
github.com/goharbor/harbor/src/jobservice/worker/cworker.(*basicWorker).registerJob.func2(0x23ddda0?)
    /harbor/src/jobservice/worker/cworker/c_worker.go:439 +0x25
github.com/gocraft/work.runJob.func1()
    /harbor/src/vendor/github.com/gocraft/work/run.go:31 +0x3f0
github.com/goharbor/harbor/src/jobservice/worker/cworker.(*workerContext).logJob(0x2?, 0xc00b5d6600, 0xc03691c370)
    /harbor/src/jobservice/worker/cworker/c_worker.go:79 +0x136
reflect.Value.call({0x1489320?, 0x1743370?, 0xc00b5d6600?}, {0x169544e, 0x4}, {0xc05b611d48, 0x3, 0x461ea5?})
    /usr/local/go/src/reflect/value.go:586 +0xb0b
reflect.Value.Call({0x1489320?, 0x1743370?, 0xc05b611cd8?}, {0xc05b611d48?, 0x16?, 0x12a8e0?})
    /usr/local/go/src/reflect/value.go:370 +0xbc
github.com/gocraft/work.runJob.func1()
    /harbor/src/vendor/github.com/gocraft/work/run.go:23 +0x215
github.com/gocraft/work.runJob(0xc00b5d6600, {0x1945ac8, 0x14ab100}, {0xc000014270, 0x1, 0x1}, 0xc000528240)
    /harbor/src/vendor/github.com/gocraft/work/run.go:51 +0x22b
github.com/gocraft/work.(*worker).processJob(0xc000792900, 0xc00b5d6600)
    /harbor/src/vendor/github.com/gocraft/work/worker.go:198 +0x20c
github.com/gocraft/work.(*worker).loop(0xc000792900)
    /harbor/src/vendor/github.com/gocraft/work/worker.go:119 +0x1cd
created by github.com/gocraft/work.(*worker).start
    /harbor/src/vendor/github.com/gocraft/work/worker.go:78 +0x5a

jwojnarowicz commented 7 months ago

@isugimpy We also have problems regarding the use of S3 backend (MinIO in our case). The GC runs very slow even though it stresses Minio's CPUs. Running 2.10 on Kubernetes so I guess it is still not resolved.

isugimpy commented 6 months ago

@stonezdj @wy65701436 Any further thoughts on this? It's been quite a while with no response and this is still an issue. Willing to try upgrading to resolve this, but given that our GC runs never make it past the mark stage to actually begin deleting, it doesn't look like an increase in worker counts will actually do anything. I'd love to find out I'm wrong here, of course, but I'm not seeing any way that the additional workers could help.

Would this project be open to a change to move the mark to be concurrent with sweeping? I'm sure there are other alternative approaches which could work as well, but that seems like the least intrusive way to deal with this.

xianlubird commented 5 months ago

Is there any other way to clean it manually first? We have the same problem now.

AllenXu93 commented 4 months ago

Same problem in v2.6.0.

wy65701436 commented 4 months ago

@isugimpy can you show me the output of those keys?

hmget {harbor_job_service_namespace}:job_stats:{gc_job_id} status
ttl {harbor_job_service_namespace}:job_stats:{gc_job_id}

AllenXu93 commented 4 months ago

@isugimpy can you show me the output of those keys?

hmget {harbor_job_service_namespace}:job_stats:{gc_job_id} status
ttl {harbor_job_service_namespace}:job_stats:{gc_job_id}

In my harbor cluster met the same issue.

hmget {harbor_job_service_namespace}:job_stats:1265a0cfa09fe700cc4a8c71 status
1) "Running"
ttl {harbor_job_service_namespace}:job_stats:1265a0cfa09fe700cc4a8c71
(integer) -1

AllenXu93 commented 4 months ago

@isugimpy can you show me the output of those keys?

hmget {harbor_job_service_namespace}:job_stats:{gc_job_id} status
ttl {harbor_job_service_namespace}:job_stats:{gc_job_id}

In my harbor cluster met the same issue.

hmget {harbor_job_service_namespace}:job_stats:1265a0cfa09fe700cc4a8c71 status
1) "Running"
ttl {harbor_job_service_namespace}:job_stats:1265a0cfa09fe700cc4a8c71
(integer) -1

For now, GC task in redis markd Error and job_stats key has a ttl; But in jobserver, gc task is still running.

> hmget {harbor_job_service_namespace}:job_stats:1265a0cfa09fe700cc4a8c71 status
1) "Error"
> ttl  {harbor_job_service_namespace}:job_stats:1265a0cfa09fe700cc4a8c71
(integer) 594374

AllenXu93 commented 4 months ago

I think I found the reason. In my case,

JobServer will mark job failed after it runs longer than 1 day (by default):

https://github.com/goharbor/harbor/blob/f86f1cebc3a1af8c5c14c0a94d687fff04ebc6eb/src/jobservice/worker/cworker/reaper.go#L173-L182 And then it will mark the job_stats redis val ttl for 7 day.

https://github.com/goharbor/harbor/blob/f86f1cebc3a1af8c5c14c0a94d687fff04ebc6eb/src/jobservice/common/rds/scripts.go#L91-L94

So after the job runs for 8(1 + 7) day, the redis val for job stats is expire, then the job will panic because it got nil in :

https://github.com/goharbor/harbor/blob/f86f1cebc3a1af8c5c14c0a94d687fff04ebc6eb/src/jobservice/job/tracker.go#L619-L622

The resolution is update the DefaultConfig.ReaperConfig.MaxUpdateHour to more than 24 hours

jcambronero commented 2 months ago

@AllenXu93 this is a great finding. We're running into this same issue. I was wondering if as a workaround we could connect to redis and manually extend the key's TTL to prevent this issue, e.g.

expire {harbor_job_service_namespace}:job_stats:1265a0cfa09fe700cc4a8c71 999999999

However, we're not sure if his could cause other unintended consequences. Any thoughts on this?

AllenXu93 commented 1 month ago

@AllenXu93 this is a great finding. We're running into this same issue. I was wondering if as a workaround we could connect to redis and manually extend the key's TTL to prevent this issue, e.g.
expire {harbor_job_service_namespace}:job_stats:1265a0cfa09fe700cc4a8c71 999999999
However, we're not sure if his could cause other unintended consequences. Any thoughts on this?

I'm not sure, I think it won't cause other negative impact. For me , I changed the MaxUpdateDuration, and it work fine.

goharbor / harbor

Large Garbage Collections Slow and Can Easily Fail Indefinitely #19991