longhorn / longhorn

Cloud-Native distributed storage built on and for Kubernetes
https://longhorn.io
Apache License 2.0
5.9k stars 584 forks source link

Mariadb readonly error once a day #3598

Open Jonaswinz opened 2 years ago

Jonaswinz commented 2 years ago

Hi, my wordpress instances are facing a "Could not connect to the database" error once a day. They are all connected to one mariadb instance. When checking the mariadb logs I always saw a "readonly error". After restarting the pods, everything is fine again, for around one day.

I searched alot about longhorn volumes getting readonly, but I didnt found anything helpfull. After scanning though the logs, I noticed a lot of timeout and connection lost logs. But this happens more often, than the "readonly error". As discussed in other issues, I think my network is sufficient enought with 500mBits and sub 1ms ping between the nodes.

Strangely I have to scale the stateful set of mariadb down and up to resolve the error. The automatical restarts of the pods doesnt seem to solve the problem automatically.

Please share some thoughts.

Log or Support bundle

Here are the e and r manager logs of the managers running on the same node as the mariadb pod. I got the "readonly error" around 2022-02-08T10:13 - 2022-02-08T10:20

manager e.txt manager r.txt

There is also the event logs of the volume:

events.txt

And also the support bundle:

https://drive.google.com/file/d/1OE-hMN7sa_R6qMIgHKRZ0ziT4H-34QJQ/view?usp=sharing

Environment

derekbit commented 2 years ago

Was the disk or network busy while the volume became read-only? From the longhorn-manager-e log, the amount of the data transfer was small. Looks like the network connections between the engine and replicas were not stable and cutoff sometimes.

Jonaswinz commented 2 years ago

I am trying to get data about the drive and network workload during the next failure. As I understand the logs, longhorn recoveres itselve after network abortion. And only sometimes I get the readonly problem. So what are the exect circumstances for the readonly problem? Does it happens, when the workload pod is trying to use the volume during recovering ?

Moreover, is it really a longorn problem? Or a problem with the workload pod, because I guess the remount whent successfully, but the workload pod unfortunatly uses the volume at the wrong time, and fails into readonly. But on the other hand the automatic restarts does not fix the problem. Only a complete recreation of the pod.

I dont know how to fix the issue. Is there some way to prevent it, or to recover it automatically.

derekbit commented 2 years ago

In most cases, Longhorn can rebuild and recover after replica disconnection. However, once the filesystem goes into read-only mode, Longhorn cannot not automatically convert the volume into read-write mode. User must remount it withrw by itself currently.

But on the other hand the automatic restarts does not fix the problem. Only a complete recreation of the pod.

Do you mean remounting the volume inside the pod still can't fix it?

shuo-wu commented 2 years ago

Longhorn would help reattach the volume automatically after the crash. But it cannot directly do remount for the volume. Therefore, Longhorn introduces a setting for this case. You can check this doc for details: https://longhorn.io/docs/1.2.3/high-availability/recover-volume/

Jonaswinz commented 2 years ago

This feature is already enabled and the workload pods (mariadb) are managed by a statefull set, so it should work, but it does not.

derekbit commented 2 years ago

This feature is already enabled and the workload pods (mariadb) are managed by a statefull set, so it should work, but it does not.

Similar issue https://github.com/longhorn/longhorn/issues/3325#issuecomment-989562860

B1ue-W01f commented 2 years ago

Im getting this same issue with postgres-operator clusters and sqlite databases. Longhorn isn't reporting any issues but files migrate to read only.

shuo-wu commented 2 years ago

@B1ue-W01f Does the volume get re-attached (automatically) before encountering the issue? If YES, you can wait for #3325 investigation.