openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
754 stars 109 forks source link

fio fails while sending IO to a nexus while a replica pod is being unscheduled #704

Closed exalate-issue-sync[bot] closed 3 years ago

exalate-issue-sync[bot] commented 3 years ago

Seen when running e2e tests on a 3 node cluster and a 3-replica volume published via nvmf. When fio is sent to a nexus while one of the replica pods is terminated, fio can fail with an error. Steps to reproduce: Provision a 3-replica volume, published using nvmf. Run fio continuously against the volume. Remove one of the non-nexus mayastor pods by e.g. removing the mayastor label from its node. Observe that fio fails with a non-zero return code. The issue is seen occasionally so the above will need to be done multiple times.

Observed that nexus restarted when this happened. Looking at the logs of the previous instance (using logs --previous), the error occurred in nexus_io.rs::child_retire().

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` 
      value: BdevNotFound
         { name: "172.18.8.103:8420/nqn.2019-05.io.openebs:dcb6b09e-8342-4a2b-a326-100e85096e02n1" }',
         mayastor/src/bdev/nexus/nexus_io.rs:340:46

The bdev is that of the faulted child.

The issue seems to be that, in a kubernetes environment, moac will attempt to remove a child from the nexus (mayastor_grpc.rs::remove_child_nexus()) if the corresponding mayastor is gone. This grpc request can coincide with the nexus’s own mechanism for faulting a child (child_retire()) when an errored IO is returned. In which case the bdev may have already been deleted.

However, the grpc request should also remove the child object from the nexus, so that the call to nexus.child_lookup() made by child_retire() should fail. There seems to be a race condition between these two code paths.

exalate-issue-sync[bot] commented 3 years ago

Jeffry Molanus commented:

Christopher Denyer

Are we making use of a filesytem or raw block device? Can you add the kernel logs?

exalate-issue-sync[bot] commented 3 years ago

Christopher Denyer commented:

Only k8s-3 (hosting the faulted replica) and k8s-2 (the other replica) had anything on the kernel logs for today (attached). Nothing for the nexus (k8s-1). Fio runs against a filesystem in this test.

exalate-issue-sync[bot] commented 3 years ago

Jeffry Molanus commented:

Jonathan Teh You were hitting a similar issue where you not?

exalate-issue-sync[bot] commented 3 years ago

Jonathan Teh commented:

No, I run Mayastor on its own and haven't tried parallel removal of a child (over grpc and with a faulted child).

exalate-issue-sync[bot] commented 3 years ago

Origination: MayaData Jira, issue null

exalate-issue-sync[bot] commented 3 years ago

Christopher Denyer commented:

This was turned into a logged error message to avoid the crash and checked in at https://github.com/openebs/Mayastor/commit/8510fb5cb469c5b46e9ee4ec3933a1e706d5e104 Another Jira was raised to look at improving the synchronization https://mayadata.atlassian.net/browse/CAS-632

GlennBullingham commented 3 years ago

Cleanup: Closing manually since the automation of jira<->github synchronisation has been suspended and this issue will receive no further updates.