pohly commented 5 years ago

The new disruptive test in PR #196 causes LVM on real hardware to become confused and report a volume with zetabytes size (for details see https://github.com/intel/pmem-CSI/pull/196#issuecomment-505846744):

/dev/dm-1                 53Z   53Z     0 100% /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-target.fIHglA/target

okartau commented 5 years ago

Change in story: rebooting of node never happened. Rebooting is supposed to happen via _work/CLUSTER/ssh helper, but that does not exist in HW-based system, so it gets skipped. But file system gets corrupted without any reboot needed. One HW-pmem node in cluster, running e2e; After test there will be this mount:

Filesystem               Size  Used Avail Use% Mounted on
/dev/dm-1                 62Z   62Z     0 100% /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-target.mAKbPI/target

In dmesg:

[Wed Jul  3 14:43:32 2019] attempt to access beyond end of device
[Wed Jul  3 14:43:32 2019] dm-1: rw=2049, want=6053473084835904072, limit=49152
[Wed Jul  3 14:43:37 2019] EXT4-fs error (device dm-1): ext4_map_blocks:592: inode #8: block 11: comm jbd2/dm-1-8: lblock 1 mapped to illegal pblock 11 (length 1)
[Wed Jul  3 14:43:37 2019] jbd2_journal_bmap: journal block not found at offset 1 on dm-1-8
[Wed Jul  3 14:43:37 2019] Aborting journal on device dm-1-8.

lvs shows that device as leftover:

# lvs
  LV                                   VG                 Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root                                 fedora             -wi-ao---- <930.19g                                                    
  e916406b-9d87-11e9-8078-7a34dbfd3c91 ndbus0region0fsdax -wi-ao----   24.00m

In e2e output, there is error logged about lvremove failing on same device:

pmem-csi-node-vqwmw/pmem-driver: I0703 11:44:27.361841       1 glog.go:58] Executing: lvremove -fy /dev/ndbus0region0fsdax/e916406b-9d87-11e9-8078-7a34dbfd3c91
pmem-csi-node-vqwmw/pmem-driver: I0703 11:44:27.430008       1 glog.go:58] Output:   Logical volume ndbus0region0fsdax/e916406b-9d87-11e9-8078-7a34dbfd3c91 contains a filesystem in use.
pmem-csi-node-vqwmw/pmem-driver: E0703 11:44:27.430042       1 glog.go:119] GRPC error: rpc error: code = Internal desc = Failed to delete volume: exit status 5
pmem-csi-controller-0/pmem-driver: W0703 11:44:27.430362       1 glog.go:99] Failed to delete volume e916406b-9d87-11e9-8078-7a34dbfd3c91: rpc error: code = Internal desc = Failed to delete volume: exit  status 5

okartau commented 5 years ago

As no reboot is involved, and the size is systematically 62 zetta bytes or so (seen multiple instances of those), this looks like ext4 internal value(s) interpreted differently on host side (Fedora-30) and inside container?

okartau commented 5 years ago

zetabyte mystery explained, no dark magic. This is what happens: umount call in container returns, but device remains in use and busy for some time at least in host context. This could be caused by LV-layer or container-host layer (docker?) delay, and is similar to creation-to-device propagation delay we were dealing earlier. Our assumption in container is different: after OK status from umount call, code immediately continues to shred, then lvremove. shred works and fills volume (which remained mounted in host) with garbage. shred does not fail as this is not error for it. Shred causes zetabytes effect we see. I guess change of contents to garbage also fails pending umount in host context, as we see volume remains mounted with insane size in host. The zetabyte size is always similar: this (I guess) is caused by shred not putting real random data there, but using some pattern which always lands same garbage on same positions (?). The lvremove following to shred, fails with: Logical volume ndbus0region0fsdax/e916406b-9d87-11e9-8078-7a34dbfd3c91 contains a filesystem in use.

To deal with issue, I plan to add a check after umount call, checking does the device remain mounted, with a short retry-wait loop (similar to loop WaitForDevice that was added for similar reasons in create path).

This issue needs to be renamed as real path was different than expressed in original title.

okartau commented 5 years ago

while debugging this more, there is next interesting finding. I added "wait really umounted" loop after umount in Unstage. This reports few cases where 100 or 200 ms wait is enough to see really unmounted device. But it also shows that sometimes there are 2 parallel threads trying to unstage with exact same device and mount path. This is strange as the block of code is protected using mutex by volumeID which should prevent such parallel threads.

pohly commented 5 years ago

Good analysis!

To deal with issue, I plan to add a check after umount call, checking does the device remain mounted, with a short retry-wait loop

Did that really help? If the issue is with mount propagation (umount inside container -> unmount outside of container) and the check runs inside the container, how can it be sure that the unmount was also done outside?

okartau commented 5 years ago

another (slightly horrifying) side-finding is log pieces like those:

pmem-csi-node-mgkrd/pmem-driver: NodeUnstageVolume: detected mountedDev: /dev/vda1
pmem-csi-node-mgkrd/pmem-driver: NodeUnstageVolume: detected mountedDev: /dev/vda1
pmem-csi-node-mgkrd/pmem-driver: NodeUnstageVolume[0]: after umount: Still seeing device /dev/vda1 for mount point /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.fPoNjG, must wait

/dev/vda1 is root file system of that node. The umount did not succeed, though.

okartau commented 5 years ago

Added log of triplet: VolumeID, device, path on same line demonstrates how we can be twice inside same mutex-section: The VolumeID is different. And that is problematic because the rest of fields (device and path) are exactly same. Note the [number] which indicates loop counter, to catch the double run.

VolumeID:8f546a58-9e2f-11e9-9c20-929faa4aedd2 after umount[0]: Still seeing device /dev/ndbus0region0fsdax/8f546a58-9e2f-11e9-9c20-929faa4aedd2 for mount point /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.lKdlJB, must wait
VolumeID:8f546a58-9e2f-11e9-9c20-929faa4aedd2 after umount[1]: Still seeing device /dev/ndbus0region0fsdax/8f546a58-9e2f-11e9-9c20-929faa4aedd2 for mount point /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.lKdlJB, must wait
VolumeID:8f546a58-9e2f-11e9-9c20-929faa4aedd2 after umount[2]: Still seeing device /dev/ndbus0region0fsdax/8f546a58-9e2f-11e9-9c20-929faa4aedd2 for mount point /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.lKdlJB, must wait
VolumeID:8f6801d7-9e2f-11e9-9c20-929faa4aedd2 after umount[0]: Still seeing device /dev/ndbus0region0fsdax/8f546a58-9e2f-11e9-9c20-929faa4aedd2 for mount point /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.lKdlJB, must wait
VolumeID:8f546a58-9e2f-11e9-9c20-929faa4aedd2 after umount[3]: Still seeing device /dev/ndbus0region0fsdax/8f546a58-9e2f-11e9-9c20-929faa4aedd2 for mount point /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.lKdlJB, must wait

pohly commented 5 years ago

How can different volumes (as identified by VolumeID) have the same device and path?

okartau commented 5 years ago

The sequnce was: stage VolA, unstage VolA (remains waiting in to see umount completions), then stage VolB on same path (while volA still remains in wait and has not umounted).

So the problem is another stage coming quickly to same path before previous volume has been unmounted. And it turns out I was testing mixed setup: code was from head of devel, but test code was from #323, seems that testin code makes this possible (is there problem in #323 with reuse of same path?) I will retry now with testing code from devel as well.

pohly commented 5 years ago

How can different volumes (as identified by VolumeID) have the same device and path?

Because the sanity stress test was broken and used the same paths in different workers. Fixed.

okartau commented 5 years ago

this pattern does not happen any more with latest tester code of #323

okartau commented 5 years ago

this pattern does not happen any more with latest tester code of #323

too early to declare that. I already dropped my "loop after umount" patch as ugly workaround, but I see on HW-based system zetabyte volumes still appear, means there is shred happening on a mounted volume. To be rechecked.

okartau commented 5 years ago

340 should avoid this issue

intel / pmem-csi

shred of mounted volume corrupts ext4 file system showing zetabyte sizes #317

340 should avoid this issue