openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
735 stars 106 forks source link

mayastor segfaults (double free or corruption) #1142

Closed Trolldemorted closed 2 years ago

Trolldemorted commented 2 years ago

Describe the bug One of our mayastor containers crashed:

[2022-06-03T11:50:50.449746241+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:921] [nqn.2019-05.io.openebs:d9cb25e4-7985-46a8-9991-2b62868e756a] in failed state.
[2022-06-03T11:50:50.449750028+00:00 ERROR mayastor::spdk:nvme_fabric.c:119] Property Get failed
[2022-06-03T11:50:50.449753454+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:3071] [nqn.2019-05.io.openebs:d9cb25e4-7985-46a8-9991-2b62868e756a] Failed to read CC and CSTS in state 3
[2022-06-03T11:50:50.449756641+00:00 ERROR mayastor::spdk:nvme.c:710] Failed to initialize SSD: 172.19.61.23
[2022-06-03T11:50:50.449763053+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:949] [nqn.2019-05.io.openebs:d9cb25e4-7985-46a8-9991-2b62868e756a] get_cc() failed
[2022-06-03T11:50:50.449813840+00:00  INFO mayastor::subsys::nvmf::subsystem:subsystem.rs:222] added NS ID 1
double free or corruption (!prev)

To Reproduce n/a I'd rather never reproduce this ever again

Expected behavior A clear and concise description of what you expected to happen.

OS info (please complete the following information):

Trolldemorted commented 2 years ago

As a side effect, the pvcs were mounted r/o.

dmesg output:

[Fri Jun  3 11:55:31 2022] EXT4-fs warning (device nvme5n1): ext4_end_bio:309: I/O error 10 writing to inode 18 (offset 0 size 1052672 starting block 2425072)
[Fri Jun  3 11:55:31 2022] Buffer I/O error on device nvme5n1, logical block 2424832
[Fri Jun  3 11:55:31 2022] Buffer I/O error on device nvme5n1, logical block 2424833
[Fri Jun  3 11:55:31 2022] Buffer I/O error on device nvme5n1, logical block 2424834
[Fri Jun  3 11:55:31 2022] Buffer I/O error on device nvme5n1, logical block 2424835
[Fri Jun  3 11:55:31 2022] Buffer I/O error on device nvme5n1, logical block 2424836
[Fri Jun  3 11:55:31 2022] Buffer I/O error on device nvme5n1, logical block 2424837
[Fri Jun  3 11:55:31 2022] Buffer I/O error on device nvme5n1, logical block 2424838
[Fri Jun  3 11:55:31 2022] Buffer I/O error on device nvme5n1, logical block 2424839
[Fri Jun  3 11:55:31 2022] Buffer I/O error on device nvme5n1, logical block 2424840
[Fri Jun  3 11:55:31 2022] Buffer I/O error on device nvme5n1, logical block 2424841
[Fri Jun  3 11:55:31 2022] EXT4-fs warning (device nvme5n1): ext4_end_bio:309: I/O error 10 writing to inode 18 (offset 0 size 2101248 starting block 2425328)
[Fri Jun  3 11:55:31 2022] EXT4-fs warning (device nvme5n1): ext4_end_bio:309: I/O error 10 writing to inode 18 (offset 0 size 3149824 starting block 2425584)
[Fri Jun  3 11:55:31 2022] EXT4-fs warning (device nvme5n1): ext4_end_bio:309: I/O error 10 writing to inode 18 (offset 0 size 4198400 starting block 2425840)
[Fri Jun  3 11:55:31 2022] EXT4-fs warning (device nvme5n1): ext4_end_bio:309: I/O error 10 writing to inode 18 (offset 0 size 5246976 starting block 2426096)
[Fri Jun  3 11:55:31 2022] EXT4-fs warning (device nvme5n1): ext4_end_bio:309: I/O error 10 writing to inode 18 (offset 0 size 6295552 starting block 2426352)
[Fri Jun  3 11:55:31 2022] EXT4-fs warning (device nvme5n1): ext4_end_bio:309: I/O error 10 writing to inode 18 (offset 0 size 7344128 starting block 2426608)
[Fri Jun  3 11:55:31 2022] EXT4-fs warning (device nvme5n1): ext4_end_bio:309: I/O error 10 writing to inode 18 (offset 0 size 8388608 starting block 2426864)
[Fri Jun  3 11:55:31 2022] EXT4-fs warning (device nvme5n1): ext4_end_bio:309: I/O error 10 writing to inode 18 (offset 8388608 size 1052672 starting block 2427120)
[Fri Jun  3 11:55:31 2022] EXT4-fs warning (device nvme5n1): ext4_end_bio:309: I/O error 10 writing to inode 18 (offset 8388608 size 2101248 starting block 2427376)
[Fri Jun  3 11:55:34 2022] Aborting journal on device nvme5n1-8.
[Fri Jun  3 11:55:34 2022] Buffer I/O error on dev nvme5n1, logical block 3702784, lost sync page write
[Fri Jun  3 11:55:34 2022] JBD2: Error -5 detected when updating journal superblock for nvme5n1-8.
[Fri Jun  3 11:55:34 2022] Buffer I/O error on dev nvme5n1, logical block 0, lost sync page write
[Fri Jun  3 11:55:34 2022] EXT4-fs (nvme5n1): I/O error while writing superblock
[Fri Jun  3 11:55:34 2022] EXT4-fs error (device nvme5n1): ext4_journal_check_start:61: Detected aborted journal
[Fri Jun  3 11:55:34 2022] EXT4-fs (nvme5n1): Remounting filesystem read-only
[Fri Jun  3 11:58:26 2022] sr 8:0:0:0: [sr0] CDROM not ready.  Make sure there is a disc in the drive.
[Fri Jun  3 11:58:26 2022] sr 8:0:0:0: [sr0] CDROM not ready.  Make sure there is a disc in the drive.
[Fri Jun  3 11:58:26 2022] nvme_ns_head_make_request: 674 callbacks suppressed
[Fri Jun  3 11:58:26 2022] block nvme4n1: no available path - failing I/O
[Fri Jun  3 11:58:26 2022] Buffer I/O error on dev nvme4n1, logical block 4, async page read
[Fri Jun  3 11:58:26 2022] block nvme4n1: no available path - failing I/O
[Fri Jun  3 11:58:26 2022] Buffer I/O error on dev nvme4n1, logical block 5, async page read
[Fri Jun  3 11:58:26 2022] block nvme4n1: no available path - failing I/O
[Fri Jun  3 11:58:26 2022] Buffer I/O error on dev nvme4n1, logical block 6, async page read
[Fri Jun  3 11:58:26 2022] block nvme4n1: no available path - failing I/O
[Fri Jun  3 11:58:26 2022] Buffer I/O error on dev nvme4n1, logical block 7, async page read
[Fri Jun  3 11:58:26 2022] block nvme4n1: no available path - failing I/O
[Fri Jun  3 11:58:26 2022] Buffer I/O error on dev nvme4n1, logical block 4, async page read
[Fri Jun  3 11:58:26 2022] block nvme4n1: no available path - failing I/O
[Fri Jun  3 11:58:26 2022] Buffer I/O error on dev nvme4n1, logical block 5, async page read
[Fri Jun  3 11:58:26 2022] block nvme4n1: no available path - failing I/O
[Fri Jun  3 11:58:26 2022] Buffer I/O error on dev nvme4n1, logical block 6, async page read
[Fri Jun  3 11:58:26 2022] block nvme4n1: no available path - failing I/O
[Fri Jun  3 11:58:26 2022] Buffer I/O error on dev nvme4n1, logical block 7, async page read
Trolldemorted commented 2 years ago

Is there anything we can do to assert the integrity of our cluster? Can we just remount the PVCs as writable? Should that happen automatically?

tiagolobocastro commented 2 years ago

I can't quite remember what 0.8.1 used to do but IIRC when the volume target is recreated it will initially contain only 1 of the previous healthy replicas - this is because in case of a crash we don't know if there were any writes in progress. From that 1 replica we then rebuild the other ones (if the volume was configured with replicas >1).

Atm, we do not however handle remounting of the volume as writable automatically.

Trolldemorted commented 2 years ago

Can I find out whether the cluster is in a healthy state again? And is there documentation about how to update/restart mayastor nodes without the cluster rebuilding replicas?

lukasmetzner commented 2 years ago

mount -o remount,rw /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-5d8925a7-ca3b-49d5-95b2-973c38c74a8e/globalmount mount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-5d8925a7-ca3b-49d5-95b2-973c38c74a8e/globalmount: cannot remount /dev/nvme3n1 read-write, is write-protected. We have tried this to remount our volumes, which did not work. How do we correctly remount our volumes?

lukasmetzner commented 2 years ago

[2022-06-07T07:52:50.030479727+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:3523] [nqn.2019-05.io.openebs:d9cb25e4-7985-46a8-9991-2b62868e756a] Submitting Keep Alive failed

Additional information from our mayastor pod on the target node. We get this message multi times per minute.

tiagolobocastro commented 2 years ago

Can I find out whether the cluster is in a healthy state again? And is there documentation about how to update/restart mayastor nodes without the cluster rebuilding replicas?

This is something we're looking into. For the moment the WA to restart mayastor nodes without rebuilding replicas is to bring down the applications first, then restart the mayastor nodes, then bring the applications back up.

mount -o remount,rw /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-5d8925a7-ca3b-49d5-95b2-973c38c74a8e/globalmount mount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-5d8925a7-ca3b-49d5-95b2-973c38c74a8e/globalmount: cannot remount /dev/nvme3n1 read-write, is write-protected. We have tried this to remount our volumes, which did not work. How do we correctly remount our volumes?

hmm I'm not sure then. Is the device connected properly? Could you run nvme list ?

[2022-06-07T07:52:50.030479727+00:00 ERROR mayastor::spdk:nvme_ctrlr.c:3523] [nqn.2019-05.io.openebs:d9cb25e4-7985-46a8-9991-2b62868e756a] Submitting Keep Alive failed

Additional information from our mayastor pod on the target node. We get this message multi times per minute.

Ah, someone else has seen this a few days ago. Unfortunately when this happened the mayastor was actually in a bad state and they had to restart it. Is that mayastor responsive?

lukasmetzner commented 2 years ago

sudo nvme list output: node1:

/dev/nvme2n1: failed to obtain nvme info: Operation not permitted
/dev/nvme3n1: failed to obtain nvme info: Operation not permitted
/dev/nvme4n1: failed to obtain nvme info: Operation not permitted
/dev/nvme5n1: failed to obtain nvme info: Operation not permitted
/dev/nvme6n1: failed to obtain nvme info: Operation not permitted
/dev/nvme7n1: failed to obtain nvme info: Operation not permitted
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     REDACTED       SAMSUNG MZWLJ3T8HBLS-00007               1           3.84  TB /   3.84  TB    512   B +  0 B   REDACTED
/dev/nvme1n1     REDACTED       SAMSUNG MZWLJ3T8HBLS-00007               1           3.84  TB /   3.84  TB    512   B +  0 B   REDACTED

node2:

Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     REDACTED       SAMSUNG MZWLJ3T8HBLS-00007               1           3.84  TB /   3.84  TB    512   B +  0 B   REDACTED
/dev/nvme1n1     REDACTED       SAMSUNG MZWLJ3T8HBLS-00007               1           3.84  TB /   3.84  TB    512   B +  0 B   REDACTED

node3:

Node             SN                   Model                                    Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     REDACTED       SAMSUNG MZWLJ3T8HBLS-00007               1           3.84  TB /   3.84  TB    512   B +  0 B   REDACTED
/dev/nvme1n1     REDACTED       SAMSUNG MZVLB1T0HBLR-00000               1         315.80  GB /   1.02  TB    512   B +  0 B   REDACTED
/dev/nvme2n1     REDACTED      Mayastor NVMe controller                 1          10.73  GB /  10.73  GB    512   B +  0 B   REDACTED

Currently I am observing that node1 is repeatedly destroying its Nexus and rebuilding all children.

lukasmetzner commented 2 years ago
apiVersion: "openebs.io/v1alpha1"
kind: MayastorPool
metadata:
  name: node1-mayastor-pool
  namespace: mayastor
spec:
  node: node1
  disks: ["/dev/nvme1n1"]

---

apiVersion: "openebs.io/v1alpha1"
kind: MayastorPool
metadata:
  name: node2-mayastor-pool
  namespace: mayastor
spec:
  node: node2
  disks: ["/dev/nvme1n1"]

---

apiVersion: "openebs.io/v1alpha1"
kind: MayastorPool
metadata:
  name: node3-mayastor-pool
  namespace: mayastor
spec:
  node: node3
  disks: ["/dev/nvme0n1"]

Additional Mayastor Configuration.

lukasmetzner commented 2 years ago

After restarting all mayastor pods, with the exception of the etcd pods, I could get one of my applications back running. I did not restart the etcd pod, because we are currently (sorry for that) still running the In-Memory etcd version. What would be the consequences of also restarting these pods? Although my application is running now, I can still see in the mayastor logs (node1), that the nexus is repeatedly destroyed and all children are rebuilt. As of now it seems to have no effect on the test application I could get running. The test application got deployed on node1.

Trolldemorted commented 2 years ago

We have thrown away our old installation, and installed a fresh mayastor 1.0.1 - thanks for your help though!

tiagolobocastro commented 2 years ago

@Trolldemorted btw there's a branch release-1.0.2 which has not been release yet as we're testing, but has quite a few bug fixes. Should have a "proper" release soon, but can't confirm any date yet. It should be usable now if you want to try it, all images on dockerhub, etc.

tiagolobocastro commented 2 years ago

After restarting all mayastor pods, with the exception of the etcd pods, I could get one of my applications back running. I did not restart the etcd pod, because we are currently (sorry for that) still running the In-Memory etcd version. What would be the consequences of also restarting these pods? Although my application is running now, I can still see in the mayastor logs (node1), that the nexus is repeatedly destroyed and all children are rebuilt. As of now it seems to have no effect on the test application I could get running. The test application got deployed on node1.

Ah, if you're using an in-memory etcd, and that gets restarted we'll use all configuration data, including a very crucial information about where the healthy replicas are. If you want to get around this, you could stop your applications, bring down the control-plane, copy the etcd data, redeploy etcd (use "manual" storageclass for example, which although not ideal is better than ram..), write the config back to etcd, bring up the control-plane and your apps.

Trolldemorted commented 2 years ago

@Trolldemorted btw there's a branch release-1.0.2 which has not been release yet as we're testing, but has quite a few bug fixes. Should have a "proper" release soon, but can't confirm any date yet. It should be usable now if you want to try it, all images on dockerhub, etc.

Thanks for the heads-up, I think we'll wait for the final 1.0.2 release, otherwise our users might get grumpy - but we are eager to roll out the update once it is finished and the documentation tells us how to upgrade a cluster!

Ah, if you're using an in-memory etcd, and that gets restarted we'll use all configuration data, including a very crucial information about where the healthy replicas are. If you want to get around this, you could stop your applications, bring down the control-plane, copy the etcd data, redeploy etcd (use "manual" storageclass for example, which although not ideal is better than ram..), write the config back to etcd, bring up the control-plane and your apps.

Thanks again for the details, we went the simple "backup the contents of the data and fill your pvcs again" way, which also worked smoothly. New etcd is running in a manual storage class just like your example here, so we are in a significantly better shape now than before :)