Open benmandrew opened 5 months ago
These issues are not seen in opam-repo-ci-local. The main difference between the two is that -local uses local Docker containers to run jobs, while -service uses a connection to an external cluster scheduler using OCluster. Thus, it is likely that the issue arises here.
opam-repo-ci-local
used to work with a cluster. It looks like you removed that feature in #258.
The PR comment said it required access to a cluster to run jobs. This is true; however, the "cluster" can be the local machine (the docker-compose.yml
shows how to do this).
Anyway, if you have access to the machine running the service, running strace on it would be useful to see what it's doing.
We talked about this in our team meeting today. I'll do my best to summarize from my notes, but it may need correction, as I am still onboarding and lacking a lot of context:
@mtelvers noted that he reset the SQLite database late on the 17th, and @benmandrew pointed out that this roughly corresponds with the recovery and stabilization of the service since the 18th. Mark proposed the following hypothesis:
We discussed a couple ways to test this hypothesis and, if correct, to ameliorate the problem:
In the discussion I think we agreed that (3) was probably worth doing in any case, since we won't care about the index after some point (3/6/12 months?), and so reducing the size of the tables after the cut off will help with perf and memory needs with no downside.
Looks like this has been resolved by the actions taken. No recurrence of the significant service degradation since the actions taken here:
I'll open an issue to track the DB pruning maintenance improvement, and close this. We can reopen it if the behavior recurs.
It looks like we had another period of degraded service for several hours today: https://status.ocaml.ci.dev/d/i1DyPsAGz/opam-repo-ci?orgId=1&refresh=5s&from=1720192947814&to=1720214547814
It could help to note any co-incident activity that might have set this off.
Looks like we've been having more degraded service again...
The background rebuild of the RAID disk could be a contributing factor at the moment.
The
opam-repo-ci
service has had random and inconsistent downtime for a long time now, as exhibited in the three-month uptime graph below.When these outages occur, the service and web-ui stop sending metrics and responding to requests. However, the machine still sends its own metrics, showing a single CPU core pinned to 100% utilisation (of which ~75% is user-thread and ~25% is system-thread), implying that the service is spinning on something.
When I run
opam-repo-ci-service
locally, I see lots of warnings logged. There seems to be one for each OCluster job that runs to completion. This reference leaking may be connected to the unreliability issues. After a certain point, I am also unable to Ctrl-C the program and have to force-kill it.These issues are not seen in
opam-repo-ci-local
. The main difference between the two is that-local
uses local Docker containers to run jobs, while-service
uses a connection to an external cluster scheduler using OCluster. Thus, it is likely that the issue arises here.