Investigating/improving cluster resilience (ReadOnlyFileSystem)

mwinokan commented 2 months ago

A lot of stack resilience problems are caused by database/redis pods not being able to access volumes (ReadOnlyFileSystem errors).

ReadOnlyFileSystem

@alanbchristie: Possible Kubernetes or longhorn issue. Equally complicated services using Amazon don't seem to show the same problems indicating it is infrastructure networking/hardware or longhorn causing the issues. Alan says that the longhorn logging may reveal what exactly in the volume provisioning is the root cause.

Possible aid: prometheus would provide a dashboard to track these issues. @tdudgeon says it's 30 minutes per cluster to deploy prometheus. (Analysing the data will be the real time sink). One risk is that prometheus may put extra load on the networking

@tdudgeon says that the longhorn version is outdated which could be the issue.

@phraenquex says to focus on the diagnostics

mwinokan commented 1 month ago

@alanbchristie seems to have spotted a pattern for the ReadOnlyFileSystem errors, for the last five weeks there has been an outage every Sunday night.

It seems that at the moment, every Monday morning the database and redis pods might need bouncing. The root cause is still unknown

alanbchristie commented 1 month ago

it is clear that it is happening on Sunday night (at or around 0:00). There is a process called fstrim that "tinkers" with the filesystem at exactly that time every week. Also - we did see critical filesystem errors on one node that we think will have lead to the reboot of a node.

We will raise a ticket with STFC to see if that indicates something at the hardware level.

mwinokan commented 1 month ago

To confirm whether fstrim is causing the ReadOnlyFileSystem @tdudgeon / @alanbchristie to run it actively and see if the error is reproducible outside of the Sunday night schedule.

alanbchristie commented 1 month ago

It is clear that the issue occurs around midnight on Sunday.

Last week the DEV small-y1 node encountered critical file-system issues causing (we believe) the node to reboot. This week so did small-y3. So it's not a particular node but it is always appears to happen shortly after midnight on Sundays.

We have set the suspicious fstrim utility to run daily now to see if anything happens tonight or any night. If it turns out to be a "cause" we will consult STFC and seriously consider disabling the service.

To change fstrim to run daily, on each node: -

Edit /lib/systemd/system/fstrim.timer and change OnCalendar to OnCalendar=daily
Restart the fstrim service with systemctl reload-or-restart fstrim

This has been done on ALL DEV cluster nodes: -

large-y1
medium-y1
small-y1
small-y3 (!)
small-y4
small-y5
ctrl-y2
ctrl-y3
etcd-y1
etcd-y3
etcd-y5

alanbchristie commented 1 month ago

The DEV cluster deployments have been adjusted to reduce the number of longhorn volumes. We started the day with 35 and now just have 12 (a 65% reduction). Some volumes have been removed completely, others have moved to nfs.

This should reduce the replication stress on the longhorn drivers - and is another attempt to improve the cluster resilience (on the assumption that there’s something going on with longhorn or the cluster’s handling of them).

alanbchristie commented 4 weeks ago

We observed ROFS on the PROD cluster this week (10th June) but not on the DEV cluster. It's early days, but this might indicate that the weekly fstrim may have something to do with the cluster instability.

STFCCLOUD-6079 has been opened to track related SSD critical issues

mwinokan commented 2 weeks ago

A recent dev cluster outage suggests that fstrim may not be the cause of these problems. @alanbchristie & @tdudgeon are continuing to investigate.

alanbchristie commented 1 week ago

The issue is now being investigated by STFC after we gave them a detailed summary of our findings. Essentially: -

It always happens overnight on Sundays
Appears to be caused by hardware issues (SCSI) on l6.small nodes
Appears to have nothing to do with fstrim

m2ms / fragalysis-frontend

Investigating/improving cluster resilience (ReadOnlyFileSystem) #1429

ReadOnlyFileSystem