Service unreliability - Githubissues

benmandrew commented 5 months ago

The opam-repo-ci service has had random and inconsistent downtime for a long time now, as exhibited in the three-month uptime graph below.

When these outages occur, the service and web-ui stop sending metrics and responding to requests. However, the machine still sends its own metrics, showing a single CPU core pinned to 100% utilisation (of which ~75% is user-thread and ~25% is system-thread), implying that the service is spinning on something.

When I run opam-repo-ci-service locally, I see lots of warnings logged. There seems to be one for each OCluster job that runs to completion. This reference leaking may be connected to the unreliability issues. After a certain point, I am also unable to Ctrl-C the program and have to force-kill it.

capnp-rpc [WARNING] Reference GC'd with rc=1!
                    switchable(4394) (unset, rc=1) -> far-ref(4393) -> i544

These issues are not seen in opam-repo-ci-local. The main difference between the two is that -local uses local Docker containers to run jobs, while -service uses a connection to an external cluster scheduler using OCluster. Thus, it is likely that the issue arises here.

talex5 commented 5 months ago

These issues are not seen in opam-repo-ci-local. The main difference between the two is that -local uses local Docker containers to run jobs, while -service uses a connection to an external cluster scheduler using OCluster. Thus, it is likely that the issue arises here.

opam-repo-ci-local used to work with a cluster. It looks like you removed that feature in #258.

The PR comment said it required access to a cluster to run jobs. This is true; however, the "cluster" can be the local machine (the docker-compose.yml shows how to do this).

Anyway, if you have access to the machine running the service, running strace on it would be useful to see what it's doing.

shonfeder commented 5 months ago

We talked about this in our team meeting today. I'll do my best to summarize from my notes, but it may need correction, as I am still onboarding and lacking a lot of context:

@mtelvers noted that he reset the SQLite database late on the 17th, and @benmandrew pointed out that this roughly corresponds with the recovery and stabilization of the service since the 18th. Mark proposed the following hypothesis:

The index database is never pruned.
As the tables grow, perhaps an expensive query begins to eat up a lot of time.
If the time for a query exceeds the configured timeout, this may cause retries to keep executing the query, and perhaps this is what the service is spinning on?

We discussed a couple ways to test this hypothesis and, if correct, to ameliorate the problem:

Add logging before the sqlite queries are executed, so we can detect whether a certain query preceeds unresponsive episodes.
Analyze the queries manually, in case we can spot any expensive, suspect joins
- If found, consider optimizing these queries
Add a maintenance process to periodically prune the database.

In the discussion I think we agreed that (3) was probably worth doing in any case, since we won't care about the index after some point (3/6/12 months?), and so reducing the size of the tables after the cut off will help with perf and memory needs with no downside.

shonfeder commented 4 months ago

Looks like this has been resolved by the actions taken. No recurrence of the significant service degradation since the actions taken here:

2024-05-15-153414_1871x578_scrot

I'll open an issue to track the DB pruning maintenance improvement, and close this. We can reopen it if the behavior recurs.

shonfeder commented 2 months ago

It looks like we had another period of degraded service for several hours today: https://status.ocaml.ci.dev/d/i1DyPsAGz/opam-repo-ci?orgId=1&refresh=5s&from=1720192947814&to=1720214547814

It could help to note any co-incident activity that might have set this off.

punchagan commented 2 months ago

Looks like we've been having more degraded service again...

https://status.ocaml.ci.dev/d/i1DyPsAGz/opam-repo-ci?orgId=1&refresh=5s&from=1720409462001&to=1720431062001

mtelvers commented 2 months ago

The background rebuild of the RAID disk could be a contributing factor at the moment.

ocurrent / opam-repo-ci

Service unreliability #293