OVH2 deploy frequently fails

jupyterhub / mybinder.org-deploy

Deployment config files for mybinder.org

https://mybinder-sre.readthedocs.io/en/latest/index.html

BSD 3-Clause "New" or "Revised" License

76 stars 74 forks source link

OVH2 deploy frequently fails #2514

Open manics opened 1 year ago

manics commented 1 year ago

The OVH2 deployment frequently deploys BinderHub, but the tests regularly fail, e.g. https://github.com/jupyterhub/mybinder.org-deploy/actions/runs/4234782302/jobs/7380258337

Attempting to launch a repo on https://ovh2.mybinder.org/ results in Rate limit exceeded. Try again in 3600 seconds.

minrk commented 1 year ago

I'm not sure about the rate limit - that seems odd.

BUt we are having registry issues with frequent errors from the registry API and pulls taking upwards of 30 minutes.

Visiting the harbor UI gives:

bad request: pq: the database system is in recovery mode

which suggests there's something seriously wrong, but as a managed service, I'm not sure how much access we have or if it will fix itself.

The size limit also seems to be a problem. Our registry cache is often very large, and currently exceeds the large quota of 6TiB. There doesn't appear to be any way to increase the size, and harbor's garbage collection doesn't seem to be deleting images on the schedule it's supposed to. Running a manual GC job let to an apparent crash of harbor itself.

@mael-le-gal any idea for how to address these issues?

minrk commented 1 year ago

@mael-le-gal is there a way to increase the quota on the private registry? Our Registry on Google is closer to 100TB, it's going to be hard to fit in only 5, which seems to be the max limit, if OVH wants to handle a significant fraction of mybinder traffic. I don't really understand why it's so low.

minrk commented 1 year ago

@mael-le-gal I've discovered at least part of the issue is that the Harbor GC job isn't running on its configured schedule (weekly) and hasn't run in a long time. Is that something you can figure out?

thcdrt commented 1 year ago

Hello @minrk, I just found out this thread, I'm working for OVH Managed private registry product, and we found the issue about the GC not being run according to its schedule recently, and we fixed it, so the next schedules should run normally.

thcdrt commented 1 year ago

In case of issue on your OVHcloud managed private registry, you can contact us on discord https://discord.com/channels/850031577277792286/955385289545756712, or if it's an urgent issue, open a ticket to OVHcloud support.

thcdrt commented 1 year ago

About the issue: bad request: pq: the database system is in recovery mode

Do you still see it ?

minrk commented 1 year ago

@thcdrt I haven't seen the pq error recently, and GC has run on schedule. The policy enforcement also runs on schedule, but seems to encounter lots of errors with object not found:

{"errors":[{"code":"NOT_FOUND","message":"{\"code\":10010,\"message\":\"object is not found\",\"details\":\"2036d34b8aa771894fa08959\"}"}]}

It's unclear if this is just an issue of an already-deleted resource that's safely ignored, or if something needs to be pursued.

thcdrt commented 1 year ago

Does the policy enforcement is working despite the errors ?

Another question, relative to this incident https://public-cloud.status-ovhcloud.com/incidents/9gpvj25b6t8m, did you encounter the last month issues on Docker images push (error or latency) ? If yes, could you send me a message on our discord https://discord.com/channels/850031577277792286/955385289545756712 please ?

minrk commented 1 year ago

I'm not allowed to send messages on the discord server yet, but yes, we did have very slow and erroring image pulls last month.

I don't know if the policy enforcement is working perfectly, but it is deleting at least some images. Not all repositories show errors, but many do (it's hard to see since I only get 5 repos per page, out of 7000).

thcdrt commented 1 year ago

Ok, how can I contact you ?

EDIT: I got you on Discord