intel / pmem-csi

Persistent Memory Container Storage Interface Driver
Apache License 2.0
163 stars 55 forks source link

LVM stress test failure #316

Closed avalluri closed 5 years ago

avalluri commented 5 years ago

When running the stress test(using #310) with 100 volumes and 10 workers, failed with timeout after when creating the 97th volume:

• Failure [2292.297 seconds]
PMEM Volumes
/home/avalluri/work/go/src/github.com/intel/pmem-csi/test/e2e/storage/csi_volumes.go:55
  late binding
  /home/avalluri/work/go/src/github.com/intel/pmem-csi/test/e2e/storage/csi_volumes.go:104
    stress test [It]
    /home/avalluri/work/go/src/github.com/intel/pmem-csi/test/e2e/storage/csi_volumes.go:147

    Unexpected error:
        <*errors.errorString | 0xc000c86840>: {
            s: "Gave up after waiting 15m0s for pod \"pvc-volume-tester-reader-worker-5-volume-97-z2wxh\" to be \"success or failure\"",
        }
        Gave up after waiting 15m0s for pod "pvc-volume-tester-reader-worker-5-volume-97-z2wxh" to be "success or failure"
    occurred

    /home/avalluri/work/go/src/github.com/intel/pmem-csi/vendor/k8s.io/kubernetes/test/e2e/storage/testsuites/provisioning.go:589

The driver logs show, somehow driver is not able to access volume file:

pmem-csi-node-2n8fm/pmem-driver: I0626 15:11:51.767476       1 glog.go:58] Executing: file -bsL /dev/ndbus0region0fsdax/e0ae1591-9822-11e9-a57c-a6a84145e48a
pmem-csi-node-2n8fm/pmem-driver: I0626 15:11:51.770971       1 glog.go:58] Output: cannot open `/dev/ndbus0region0fsdax/e0ae1591-9822-11e9-a57c-a6a84145e48a' (No such file or directory)
pmem-csi-node-2n8fm/pmem-driver: I0626 15:11:51.770992       1 glog.go:58] Executing: blkid -c /dev/null -o full /dev/ndbus0region0fsdax/e0ae1591-9822-11e9-a57c-a6a84145e48a
pmem-csi-node-2n8fm/pmem-driver: I0626 15:11:51.774867       1 glog.go:58] Output: 
pmem-csi-node-2n8fm/pmem-driver: E0626 15:11:51.774899       1 glog.go:119] NodeStageVolume: determineFilesystemType failed: exit status 2

LVM on host shows that the volume(s) exists:

$ _work/test-latebinding/ssh.1 sudo lvs --foreign
  LV                                   VG                 Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  2e91fb61-981c-11e9-a57c-a6a84145e48a ndbus0region0fsdax -wi-a----- 4.00m                                                    
  2ecc3383-981c-11e9-a57c-a6a84145e48a ndbus0region0fsdax -wi-a----- 4.00m                                                    
  2f4ff278-981c-11e9-a57c-a6a84145e48a ndbus0region0fsdax -wi-a----- 4.00m                                                    
  e0ae1591-9822-11e9-a57c-a6a84145e48a ndbus0region0fsdax -wi-a----- 4.00m 

But cannot access those volumes:

$ _work/test-latebinding/ssh.1 sudo ls /dev/ndbus0region0fsdax/e0ae1591-9822-11e9-a57c-a6a84145e48a
ls: cannot access '/dev/ndbus0region0fsdax/e0ae1591-9822-11e9-a57c-a6a84145e48a': No such file or directory
okartau commented 5 years ago

confused what is root cause and is there 1 or 2 problems mixed here. Few days ago while testing e2e on code of #196, I saw failure with end message being same as here: pod did not get out of "pending" state for 15 minutes. But there (@avalluri helped to check the test log) the reason for hanging pod remained unclear: was not readable in that log. The related volume was correctly mounted by pmem-csi node, and later unmounted and deleted after 15-min timeout occurred.

pohly commented 5 years ago

The sanity stress test from PR #323 can be used to focus more on the actual driver operations. When running it with one worker, all 100 volumes get created, staged, and removed just fine. When using 10 workers, the test fails quickly.

The failed operation is related to worker-6-volume-7, because that is the one which needs to be cleaned up after the test failed:

[AfterEach] pmem csi
  /nvme/gopath/src/github.com/intel/pmem-csi/test/e2e/storage/sanity.go:174
cleanup: deleting worker-6-volume-7-EB0C8FFE-D23FEEB4 = 71ddd139-9d62-11e9-aa47-5e9acd188e55
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.578331       1 glog.go:58] GRPC call: /csi.v1.Node/NodeUnpublishVolume
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.578341       1 glog.go:58] GRPC request: {"target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-target.EkfniG","volume_id":"71ddd139-9d62-11e9-aa47-5e9acd188e55"}
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.578625       1 glog.go:58] NodeUnpublishVolume: unmount /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-target.EkfniG
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.580996       1 mount_linux.go:166] Cannot run systemd-run, assuming non-systemd OS
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.581005       1 mount_linux.go:167] systemd-run failed with: exit status 1
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.581012       1 mount_linux.go:168] systemd-run output: System has not been booted with systemd as init system (PID 1). Can't operate.
pmem-csi-node-x7g8d/pmem-driver: Failed to create bus connection: Host is down
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.581034       1 glog.go:58] GRPC response: {}
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.582157       1 glog.go:58] GRPC call: /csi.v1.Node/NodeUnstageVolume
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.582166       1 glog.go:58] GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC","volume_id":"71ddd139-9d62-11e9-aa47-5e9acd188e55"}
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.582845       1 glog.go:58] NodeUnStageVolume: VolumeID:71ddd139-9d62-11e9-aa47-5e9acd188e55 VolumeName: Staging target path:/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.582857       1 glog.go:58] Executing: lvs --noheadings --nosuffix -o lv_name,lv_path,lv_size --units B ndbus0region0fsdax
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.610502       1 glog.go:58] Output:   71dc769d-9d62-11e9-aa47-5e9acd188e55 /dev/ndbus0region0fsdax/71dc769d-9d62-11e9-aa47-5e9acd188e55 16777216
pmem-csi-node-x7g8d/pmem-driver:   71ddd139-9d62-11e9-aa47-5e9acd188e55 /dev/ndbus0region0fsdax/71ddd139-9d62-11e9-aa47-5e9acd188e55 16777216
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.613762       1 mount_linux.go:166] Cannot run systemd-run, assuming non-systemd OS
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.613774       1 mount_linux.go:167] systemd-run failed with: exit status 1
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.613781       1 mount_linux.go:168] systemd-run output: System has not been booted with systemd as init system (PID 1). Can't operate.
pmem-csi-node-x7g8d/pmem-driver: Failed to create bus connection: Host is down
pmem-csi-node-x7g8d/pmem-driver: W0703 07:16:18.614009       1 glog.go:99] NodeUnstageVolume: No device name for mount point '/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC'
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.614018       1 glog.go:58] GRPC response: {}
pmem-csi-controller-0/pmem-driver: I0703 07:16:18.616101       1 glog.go:58] GRPC call: /csi.v1.Controller/DeleteVolume
pmem-csi-controller-0/pmem-driver: I0703 07:16:18.616114       1 glog.go:58] GRPC request: {"volume_id":"71ddd139-9d62-11e9-aa47-5e9acd188e55"}
pmem-csi-controller-0/pmem-driver: I0703 07:16:18.616442       1 glog.go:58] DeleteVolume: volumeID: 71ddd139-9d62-11e9-aa47-5e9acd188e55
pmem-csi-controller-0/pmem-driver: I0703 07:16:18.616452       1 glog.go:58] Connecting to 172.17.0.7:10001
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.619501       1 glog.go:58] GRPC call: /csi.v1.Controller/DeleteVolume
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.619509       1 glog.go:58] GRPC request: {"volume_id":"71ddd139-9d62-11e9-aa47-5e9acd188e55"}
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.619764       1 glog.go:58] DeleteVolume: volumeID: 71ddd139-9d62-11e9-aa47-5e9acd188e55
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.619771       1 glog.go:58] Executing: lvs --noheadings --nosuffix -o lv_name,lv_path,lv_size --units B ndbus0region0fsdax
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.641507       1 glog.go:58] Output:   71dc769d-9d62-11e9-aa47-5e9acd188e55 /dev/ndbus0region0fsdax/71dc769d-9d62-11e9-aa47-5e9acd188e55 16777216
pmem-csi-node-x7g8d/pmem-driver:   71ddd139-9d62-11e9-aa47-5e9acd188e55 /dev/ndbus0region0fsdax/71ddd139-9d62-11e9-aa47-5e9acd188e55 16777216
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.641530       1 glog.go:58] ClearDevice: path: /dev/ndbus0region0fsdax/71ddd139-9d62-11e9-aa47-5e9acd188e55 flush:true
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.641542       1 glog.go:58] Wiping entire device: /dev/ndbus0region0fsdax/71ddd139-9d62-11e9-aa47-5e9acd188e55
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.641547       1 glog.go:58] Executing: shred -n 1 /dev/ndbus0region0fsdax/71ddd139-9d62-11e9-aa47-5e9acd188e55
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.657926       1 glog.go:58] Output: 
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.657942       1 glog.go:58] Executing: lvremove -fy /dev/ndbus0region0fsdax/71ddd139-9d62-11e9-aa47-5e9acd188e55
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.684773       1 glog.go:58] Output:   Logical volume "71ddd139-9d62-11e9-aa47-5e9acd188e55" successfully removed
pmem-csi-controller-0/pmem-driver: I0703 07:16:18.687928       1 glog.go:58] DeleteVolume: volume 71ddd139-9d62-11e9-aa47-5e9acd188e55 deleted
pmem-csi-controller-0/pmem-driver: I0703 07:16:18.687943       1 glog.go:58] GRPC response: {}
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.687161       1 glog.go:58] DeleteVolume: volume 71ddd139-9d62-11e9-aa47-5e9acd188e55 deleted
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:18.687189       1 glog.go:58] GRPC response: {}

Note that the test suite hangs at that point. I don't know why.

Looking for worker-6-volume-7 in the rest of the log shows:

STEP: worker-6-volume-7: cleaning up calling nodeunstage
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:17.135539       1 glog.go:58] NodeStageVolume: mount /dev/ndbus0region0fsdax/71e03b75-9d62-11e9-aa47-5e9acd188e55 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:17.135551       1 glog.go:58] NodeStageVolume: namespacemode FSDAX, add dax mount option
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:17.135559       1 glog.go:58] NodeStageVolume: mount args: [[-c -o dax /dev/ndbus0region0fsdax/71e03b75-9d62-11e9-aa47-5e9acd188e55 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC]]
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:17.135572       1 glog.go:58] Executing: mount -c -o dax /dev/ndbus0region0fsdax/71e03b75-9d62-11e9-aa47-5e9acd188e55 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:17.142781       1 glog.go:58] GRPC call: /csi.v1.Node/NodeUnstageVolume
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:17.142791       1 glog.go:58] GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC","volume_id":"71ddd139-9d62-11e9-aa47-5e9acd188e55"}
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:17.143171       1 glog.go:58] NodeUnStageVolume: VolumeID:71ddd139-9d62-11e9-aa47-5e9acd188e55 VolumeName: Staging target path:/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:17.143177       1 glog.go:58] Executing: lvs --noheadings --nosuffix -o lv_name,lv_path,lv_size --units B ndbus0region0fsdax
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:17.148507       1 glog.go:58] Output:   71dc769d-9d62-11e9-aa47-5e9acd188e55 /dev/ndbus0region0fsdax/71dc769d-9d62-11e9-aa47-5e9acd188e55 16777216
pmem-csi-node-x7g8d/pmem-driver:   71ddd139-9d62-11e9-aa47-5e9acd188e55 /dev/ndbus0region0fsdax/71ddd139-9d62-11e9-aa47-5e9acd188e55 16777216
pmem-csi-node-x7g8d/pmem-driver:   71e03b75-9d62-11e9-aa47-5e9acd188e55 /dev/ndbus0region0fsdax/71e03b75-9d62-11e9-aa47-5e9acd188e55 16777216
pmem-csi-node-x7g8d/pmem-driver:   71e06ef3-9d62-11e9-aa47-5e9acd188e55 /dev/ndbus0region0fsdax/71e06ef3-9d62-11e9-aa47-5e9acd188e55 16777216
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:17.148529       1 glog.go:58] ClearDevice: path: /dev/ndbus0region0fsdax/71e06ef3-9d62-11e9-aa47-5e9acd188e55 flush:true
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:17.148541       1 glog.go:58] Wiping entire device: /dev/ndbus0region0fsdax/71e06ef3-9d62-11e9-aa47-5e9acd188e55
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:17.148545       1 glog.go:58] Executing: shred -n 1 /dev/ndbus0region0fsdax/71e06ef3-9d62-11e9-aa47-5e9acd188e55
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:17.151080       1 glog.go:58] Output: 
...
pmem-csi-node-x7g8d/pmem-driver: I0703 07:16:17.205102       1 mount_linux.go:203] Unmounting /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC
pmem-csi-node-x7g8d/pmem-driver: E0703 07:16:17.213469       1 glog.go:119] NodeUnstageVolume: Umount failed: Unmount failed: exit status 32
pmem-csi-node-x7g8d/pmem-driver: Unmounting arguments: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC
pmem-csi-node-x7g8d/pmem-driver: Output: umount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC: not mounted.
pmem-csi-node-x7g8d/pmem-driver: 
Jul  3 09:16:17.219: INFO: Unexpected error occurred: rpc error: code = Unknown desc = Unmount failed: exit status 32
Unmounting arguments: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC
Output: umount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC: not mounted.

STEP: worker-7-volume-8: cleaning up calling nodeunstage
Jul  3 09:16:17.220: INFO: ERROR: worker-6 failed: 
Your test failed.
Ginkgo panics to prevent subsequent assertions from running.
Normally Ginkgo rescues this panic so you shouldn't see it.

But, if you make an assertion in a goroutine, Ginkgo can't capture the panic.
To circumvent this, you should call

    defer GinkgoRecover()

at the top of the goroutine that caused this panic.

STEP: worker-6 terminating
pmem-csi-node-x7g8d/pmem-driver: E0703 07:16:17.217256       1 glog.go:119] GRPC error: Unmount failed: exit status 32
pmem-csi-node-x7g8d/pmem-driver: Unmounting arguments: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC
pmem-csi-node-x7g8d/pmem-driver: Output: umount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pmem-sanity-staging.DKDcIC: not mounted.

The logs are a bit hard to read because there's no way to associate driver output with the request that is being processed. Context-aware, structured logging would help there.

But ultimately it is NodeUnstageVolume which fails. The "Your test failed. Ginkgo panics to prevent subsequent assertions from running." output is from https://github.com/intel/pmem-csi/blob/7f73fb48f3dc921b2ea5093dad30e25f02f426fc/test/e2e/storage/sanity.go#L268

I can try to make to make the output nicer, but for now it can be ignored.

pohly commented 5 years ago

I updated the stress test PR, now it no longer hangs and that "worker failed" output is gone.

pohly commented 5 years ago

I once (and only once) also saw CreateVolume fail. Unfortunately I didn't capture the log output immediately.

avalluri commented 5 years ago

With change #336, we shouldn't see this issue anymore, Shall we close this issue?

okartau commented 5 years ago

yes, no stress failures with recent code. Closing