Open manics opened 1 year ago
I'm not sure about the rate limit - that seems odd.
BUt we are having registry issues with frequent errors from the registry API and pulls taking upwards of 30 minutes.
Visiting the harbor UI gives:
bad request: pq: the database system is in recovery mode
which suggests there's something seriously wrong, but as a managed service, I'm not sure how much access we have or if it will fix itself.
The size limit also seems to be a problem. Our registry cache is often very large, and currently exceeds the large quota of 6TiB. There doesn't appear to be any way to increase the size, and harbor's garbage collection doesn't seem to be deleting images on the schedule it's supposed to. Running a manual GC job let to an apparent crash of harbor itself.
@mael-le-gal any idea for how to address these issues?
@mael-le-gal is there a way to increase the quota on the private registry? Our Registry on Google is closer to 100TB, it's going to be hard to fit in only 5, which seems to be the max limit, if OVH wants to handle a significant fraction of mybinder traffic. I don't really understand why it's so low.
@mael-le-gal I've discovered at least part of the issue is that the Harbor GC job isn't running on its configured schedule (weekly) and hasn't run in a long time. Is that something you can figure out?
Hello @minrk, I just found out this thread, I'm working for OVH Managed private registry product, and we found the issue about the GC not being run according to its schedule recently, and we fixed it, so the next schedules should run normally.
In case of issue on your OVHcloud managed private registry, you can contact us on discord https://discord.com/channels/850031577277792286/955385289545756712, or if it's an urgent issue, open a ticket to OVHcloud support.
About the issue:
bad request: pq: the database system is in recovery mode
Do you still see it ?
@thcdrt I haven't seen the pq error recently, and GC has run on schedule. The policy enforcement also runs on schedule, but seems to encounter lots of errors with object not found:
{"errors":[{"code":"NOT_FOUND","message":"{\"code\":10010,\"message\":\"object is not found\",\"details\":\"2036d34b8aa771894fa08959\"}"}]}
It's unclear if this is just an issue of an already-deleted resource that's safely ignored, or if something needs to be pursued.
Does the policy enforcement is working despite the errors ?
Another question, relative to this incident https://public-cloud.status-ovhcloud.com/incidents/9gpvj25b6t8m, did you encounter the last month issues on Docker images push (error or latency) ? If yes, could you send me a message on our discord https://discord.com/channels/850031577277792286/955385289545756712 please ?
I'm not allowed to send messages on the discord server yet, but yes, we did have very slow and erroring image pulls last month.
I don't know if the policy enforcement is working perfectly, but it is deleting at least some images. Not all repositories show errors, but many do (it's hard to see since I only get 5 repos per page, out of 7000).
Ok, how can I contact you ?
EDIT: I got you on Discord
The OVH2 deployment frequently deploys BinderHub, but the tests regularly fail, e.g. https://github.com/jupyterhub/mybinder.org-deploy/actions/runs/4234782302/jobs/7380258337
Attempting to launch a repo on https://ovh2.mybinder.org/ results in
Rate limit exceeded. Try again in 3600 seconds.