Bringing down one of the replicas renders the volume unusable

jkryl commented 4 years ago

It is a combination of control & data plane issues that results in a sad moment of volume being unavailable although that just one replica out of N is missing. Test environment is a 3 node cluster with 3 replicas of the same volume. We deliberately terminate one of the nodes that runs the application and one of the replicas (not a nexus - that's a different issue #508). So what happens then:

Following message appears cca 20x in mayastor log file (instance with the nexus):

2020-11-12T15:46:05.443326042+00:00  WARN mayastor::logger: Warning: Detected a timeout. ctrlr=0x55621430eca0 qpair=(nil) cid=2

then

2020-11-12T15:47:05.445682886+00:00 ERROR mayastor::logger: Controller Fatal Status, reset required
2020-11-12T16:01:40.388109878+00:00 ERROR mayastor::logger: CQ error, abort requests after transport retry counter exceeded
2020-11-12T16:01:40.388162053+00:00 ERROR mayastor::logger: ctrlr 172.20.52.199 in failed state.
2020-11-12T16:01:42.388601297+00:00 ERROR mayastor::logger: Failed to construct the tqpair=0x5562143105e0 via correct icresp
2020-11-12T16:01:42.388646242+00:00 ERROR mayastor::logger: Unable to connect the tqpair
2020-11-12T16:01:42.388710242+00:00 ERROR mayastor::logger: Controller reinitialization failed.
2020-11-12T16:01:42.388721983+00:00 ERROR mayastor::logger: ctrlr 172.20.37.187 in failed state.
2020-11-12T16:01:42.388729381+00:00 ERROR mayastor::logger: Resetting controller failed.
2020-11-12T16:01:42.395028594+00:00 ERROR mayastor::logger: getpeername() failed (errno=107)
2020-11-12T16:01:42.395066223+00:00 ERROR mayastor::logger: spdk_sock_getaddr() failed
2020-11-12T16:01:42.395088161+00:00 ERROR mayastor::logger: spdk_iscsi_connection_construct() failed
2020-11-12T16:01:42.395120389+00:00 ERROR mayastor::logger: Submitting Keep Alive failed
2020-11-12T16:01:43.959954068+00:00 ERROR mayastor::logger: Connect command failed
2020-11-12T16:01:43.960000578+00:00 ERROR mayastor::logger: Failed to send an NVMe-oF Fabric CONNECT command

repeated many times. And then a panic at the end:

2020-11-12T16:03:16.119653376+00:00 ERROR mayastor::logger: Failed to send an NVMe-oF Fabric CONNECT command
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: GetIoChannel { name: "172.20.37.187:8420/nqn.2019-05.io.openebs:c8059146-35e1-480e-a983-2d69a0ad1d79n1" }', mayastor/src/bdev/nexus/nexus_channel.rs:155:71
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2020-11-12T16:03:16.119742718+00:00 ERROR mayastor::logger: Controller reinitialization failed.
2020-11-12T16:03:16.119793063+00:00 ERROR mayastor::logger: ctrlr 172.20.37.187 in failed state.
2020-11-12T16:03:16.119814482+00:00 ERROR mayastor::logger: Resetting controller failed.
2020-11-12T16:03:16.120056430+00:00 ERROR mayastor::logger: get_cc failed

What does control plane in the meantime?

Nov 12 15:56:45.166 debug [csi]: Request to unpublish volume "fc60eaf5-b5c7-4031-815f-094241595536"
Nov 12 15:56:45.171 debug [nexus]: Unpublishing nexus "fc60eaf5-b5c7-4031-815f-094241595536@ip-172-20-37-
187.eu-central-1.compute.internal" ...

k8s tries to unpublish the volume so that it can move the application to a different node. But mayastor node fails to reply and moac waits forever (that is what #506 is about). And because this fails the application cannot be scheduled and is stuck. Nothing else can use the volume. Control plane side of the things will be fixed on behalf of #506. This ticket exists to track the problem in data plane (mayastor) and its inadequate reaction to the situation when one of the replicas becomes unreachable (should not crash).

gila commented 4 years ago

I've been able to replicate this unfortunate behaviour. There is a mixture of things here, most notable that our default settings are still in development mode. That is to say, we poll the queue's at such a high rate, the error handling hardly makes any progress.

In any case, I've written a simple test case that exhibits the problem, the relevant snippet:

create_work(Arc::clone(&queue)).await; // start pumping IO
let mut ticker = tokio::time::interval(Duration::from_secs(1));

for i in 1 .. 20 {
        ticker.tick().await;

        if i == 5 {
            kill_replica().await;
        }

        // after each tick, display bdev stats
        ms.spawn(async move {
            let bdev = Bdev::bdev_first().unwrap().into_iter();
            for b in bdev {
                let result = b.stats().await.unwrap();
                println!("{}: {:?}", b.name(), result);
            }
        })
        .await;

        // ctrl was hit during the test, exit now
        if SIG_RECEIVED.load(Ordering::Relaxed) {
            break;
        }
    }

Several crashes were observed during these tests, but with some initial improvements to the settings, some reordering of the removal code I get the following output:

10.1.0.3:8420/nqn.2019-05.io.openebs:disk0n1: BdevStats { num_read_ops: 112, num_write_ops: 2, bytes_read: 57344, bytes_written: 34304 }
10.1.0.2:8420/nqn.2019-05.io.openebs:disk0n1: BdevStats { num_read_ops: 131, num_write_ops: 2, bytes_read: 67072, bytes_written: 34304 }
nexus0: BdevStats { num_read_ops: 262, num_write_ops: 0, bytes_read: 134144, bytes_written: 0 }
disk0: BdevStats { num_read_ops: 112, num_write_ops: 0, bytes_read: 57344, bytes_written: 0 }
10.1.0.3:8420/nqn.2019-05.io.openebs:disk0n1: BdevStats { num_read_ops: 7285, num_write_ops: 2, bytes_read: 3729920, bytes_written: 34304 }
10.1.0.2:8420/nqn.2019-05.io.openebs:disk0n1: BdevStats { num_read_ops: 7294, num_write_ops: 2, bytes_read: 3734528, bytes_written: 34304 }
nexus0: BdevStats { num_read_ops: 14633, num_write_ops: 0, bytes_read: 7492096, bytes_written: 0 }
disk0: BdevStats { num_read_ops: 7343, num_write_ops: 0, bytes_read: 3759616, bytes_written: 0 }
10.1.0.3:8420/nqn.2019-05.io.openebs:disk0n1: BdevStats { num_read_ops: 14485, num_write_ops: 2, bytes_read: 7416320, bytes_written: 34304 }
10.1.0.2:8420/nqn.2019-05.io.openebs:disk0n1: BdevStats { num_read_ops: 14502, num_write_ops: 2, bytes_read: 7425024, bytes_written: 34304 }
nexus0: BdevStats { num_read_ops: 29033, num_write_ops: 0, bytes_read: 14864896, bytes_written: 0 }
disk0: BdevStats { num_read_ops: 14590, num_write_ops: 0, bytes_read: 7470080, bytes_written: 0 }
10.1.0.3:8420/nqn.2019-05.io.openebs:disk0n1: BdevStats { num_read_ops: 21674, num_write_ops: 2, bytes_read: 11097088, bytes_written: 34304 }
10.1.0.2:8420/nqn.2019-05.io.openebs:disk0n1: BdevStats { num_read_ops: 21674, num_write_ops: 2, bytes_read: 11097088, bytes_written: 34304 }
nexus0: BdevStats { num_read_ops: 43378, num_write_ops: 0, bytes_read: 22209536, bytes_written: 0 }
disk0: BdevStats { num_read_ops: 21813, num_write_ops: 0, bytes_read: 11168256, bytes_written: 0 }

A little hard to read but the key thing here is that we have nexus0 to which we are pushing IO and two replicas starting with their respective containers IP. After 5 ticks, one container is killed and we observe:

020-11-13T16:46:40.469970079+01:00 ERROR mayastor::logger: CQ error, abort requests after transport retry counter exceeded
2020-11-13T16:46:40.470041705+01:00 ERROR mayastor::logger: ctrlr 10.1.0.2 in failed state.
2020-11-13T16:46:40.470093964+01:00 ERROR mayastor::logger: readv failed: rc = -6
2020-11-13T16:46:40.470189225+01:00 ERROR mayastor::logger: readv failed: rc = -6
2020-11-13T16:46:40.471346149+01:00 ERROR mayastor::core::io_driver: job_nexus0: 0x000020000397e0c0
2020-11-13T16:46:40.471392677+01:00 ERROR mayastor::logger: readv failed: rc = -6
<SNIP>

A few more lines are shown but once we start processing the completion callback we can see that the system is reconfiguring:

2020-11-13T16:46:40.475426377+01:00  INFO mayastor::bdev::nexus::nexus_bdev: nexus0: Dynamic reconfiguration event: ChildFault started 
2020-11-13T16:46:40.479926742+01:00 ERROR mayastor::core::io_driver: job_nexus0: 0x000020000395ecc0
2020-11-13T16:46:40.479965104+01:00  INFO mayastor::bdev::nexus::nexus_channel: nexus0(tid:"none"), refreshing IO channels
2020-11-13T16:46:40.480043283+01:00 ERROR mayastor::core::io_driver: job_nexus0: 0x000020000397e0c0
2020-11-13T16:46:40.480092867+01:00 ERROR mayastor::core::io_driver: job_nexus0: 0x00002000039650c0
2020-11-13T16:46:40.481241124+01:00 ERROR mayastor::core::io_driver: job_nexus0: 0x0000200003990040
2020-11-13T16:46:40.481302110+01:00 ERROR mayastor::core::io_driver: job_nexus0: 0x0000200003978940
2020-11-13T16:46:40.481355280+01:00 ERROR mayastor::core::io_driver: job_nexus0: 0x000020000398a8c0
2020-11-13T16:46:40.481402029+01:00 ERROR mayastor::core::io_driver: job_nexus0: 0x00002000039669c0
2020-11-13T16:46:40.481448427+01:00 ERROR mayastor::core::io_driver: job_nexus0: 0x000020000398e740
2020-11-13T16:46:40.481494555+01:00 ERROR mayastor::core::io_driver: job_nexus0: 0x00002000039682c0
2020-11-13T16:46:40.481554148+01:00 ERROR mayastor::core::io_driver: job_nexus0: 0x0000200003981f40
2020-11-13T16:46:40.481612448+01:00 ERROR mayastor::core::io_driver: job_nexus0: 0x00002000039763c0
2020-11-13T16:46:40.482749925+01:00  INFO mayastor::bdev::nexus::nexus_bdev: nexus0: Dynamic reconfiguration event: ChildFault completed Ok(0)
2020-11-13T16:46:40.482809989+01:00  INFO mayastor::bdev::nexus::nexus_child: Closing child nvmf://10.1.0.2:8420/nqn.2019-05.io.openebs:disk0
2020-11-13T16:46:40.482908626+01:00  INFO bdev_destroy{uri="nvmf://10.1.0.2:8420/nqn.2019-05.io.openebs:disk0"}: mayastor::nexus_uri: new
2020-11-13T16:46:40.482967538+01:00  INFO bdev_destroy{uri="nvmf://10.1.0.2:8420/nqn.2019-05.io.openebs:disk0"}: mayastor::nexus_uri: enter
2020-11-13T16:46:40.483120658+01:00  INFO bdev_destroy{uri="nvmf://10.1.0.2:8420/nqn.2019-05.io.openebs:disk0"}: mayastor::nexus_uri: exit
2020-11-13T16:46:40.483166115+01:00  INFO bdev_destroy{uri="nvmf://10.1.0.2:8420/nqn.2019-05.io.openebs:disk0"}: mayastor::nexus_uri: close time.busy=1

Once this has all passed we can see we are pushing IO again, but the failed device has been removed:

nexus0: BdevStats { num_read_ops: 57934, num_write_ops: 0, bytes_read: 29662208, bytes_written: 0 }
disk0: BdevStats { num_read_ops: 29180, num_write_ops: 0, bytes_read: 14940160, bytes_written: 0 }
10.1.0.3:8420/nqn.2019-05.io.openebs:disk0n1: BdevStats { num_read_ops: 35934, num_write_ops: 2, bytes_read: 18398208, bytes_written: 34304 }
nexus0: BdevStats { num_read_ops: 64870, num_write_ops: 0, bytes_read: 33213440, bytes_written: 0 }
disk0: BdevStats { num_read_ops: 36171, num_write_ops: 0, bytes_read: 18519552, bytes_written: 0 }
10.1.0.3:8420/nqn.2019-05.io.openebs:disk0n1: BdevStats { num_read_ops: 43126, num_write_ops: 2, bytes_read: 22080512, bytes_written: 34304 }
nexus0: BdevStats { num_read_ops: 72070, num_write_ops: 0, bytes_read: 36899840, bytes_written: 0 }
disk0: BdevStats { num_read_ops: 43402, num_write_ops: 0, bytes_read: 22221824, bytes_written: 0 }
10.1.0.3:8420/nqn.2019-05.io.openebs:disk0n1: BdevStats { num_read_ops: 50326, num_write_ops: 2, bytes_read: 25766912, bytes_written: 34304 }
nexus0: BdevStats { num_read_ops: 79262, num_write_ops: 0, bytes_read: 40582144, bytes_written: 0 }
disk0: BdevStats { num_read_ops: 50657, num_write_ops: 0, bytes_read: 25936384, bytes_written: 0 }

This test was using the API directly, that is, there is no target involved. We should verify that is works correctly with a target and initiator but usually, that's far more forgiving than then API directly (as it implicitly has reset and retry logic).

gila commented 3 years ago

I've done some more testing around this particular issue and here is an excerpt of the things I've done (apart from the test added in the previous comment)

With two nvmf targets pre-configured:

./target/debug/mayastor-client nexus create $UUID 500MiB "nvmf://127.0.0.1:4420/replica0 nvmf://127.0.0.1:4420/replica1"                     
Nexus e46329bb-7e7a-480b-8b97-acc17fb31ba5 created

$ ./target/debug/mayastor-client  bdev share nexus-e46329bb-7e7a-480b-8b97-acc17fb31ba5                                                        
{
  "uri": "nvmf://127.0.0.1:8420/nqn.2019-05.io.openebs:nexus-e46329bb-7e7a-480b-8b97-acc17fb31ba5"
}

This, in effect, is the same sequence of calls that the CSI driver does. Once the target is shared, we can connect to and run some IO:

sudo nvme connect -t tcp -a 127.0.0.1 -s 8420 -n nqn.2019-05.io.openebs:nexus-e46329bb-7e7a-480b-8b97-acc17fb31ba5

The device shows up as nvme3, and we then using fio to drive some IO to it. Once that is running, I disable one of the targets, which in the logs yields:

ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
WARN mayastor::bdev::nexus::nexus_io:nexus_io.rs:329] core 1 thread Some(Mthread(0x55db26432f40)), faulting child nvmf://127.0.0.1:4420/replica0: Faulted(IoError), blk_cnt: 262144, blk_size: 4096

ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
 INFO mayastor::bdev::nexus::nexus_bdev:nexus_bdev.rs:454] nexus-e46329bb-7e7a-480b-8b97-acc17fb31ba5: Dynamic reconfiguration event: ChildFault started
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
ERROR mayastor::bdev::nexus::nexus_io:nexus_io.rs:319] name: 127.0.0.1:4420/replica0n1, driver: nvme, product: NVMe disk, num_blocks: 262144, block_len: 4096
 INFO mayastor::bdev::nexus::nexus_channel:nexus_channel.rs:102] nexus-e46329bb-7e7a-480b-8b97-acc17fb31ba5(thread:"mayastor_nvmf_tcp_pg_core_1"), refreshing IO channels
 INFO mayastor::bdev::nexus::nexus_channel:nexus_channel.rs:102] nexus-e46329bb-7e7a-480b-8b97-acc17fb31ba5(thread:"mayastor_nvmf_tcp_pg_core_2"), refreshing IO channels
 INFO mayastor::bdev::nexus::nexus_channel:nexus_channel.rs:102] nexus-e46329bb-7e7a-480b-8b97-acc17fb31ba5(thread:"mayastor_nvmf_tcp_pg_core_34"), refreshing IO channels
 INFO mayastor::bdev::nexus::nexus_channel:nexus_channel.rs:248] nexus-e46329bb-7e7a-480b-8b97-acc17fb31ba5: Reconfigure completed
 INFO mayastor::bdev::nexus::nexus_bdev:nexus_bdev.rs:468] nexus-e46329bb-7e7a-480b-8b97-acc17fb31ba5: Dynamic reconfiguration event: ChildFault completed 0
 INFO mayastor::nexus_uri:bdev_destroy{uri="nvmf://127.0.0.1:4420/replica0"}:nexus_uri.rs:79] new
 INFO mayastor::nexus_uri:bdev_destroy{uri="nvmf://127.0.0.1:4420/replica0"}:nexus_uri.rs:79] enter
 INFO mayastor::nexus_uri:bdev_destroy{uri="nvmf://127.0.0.1:4420/replica0"}:nexus_uri.rs:79] exit
 INFO mayastor::nexus_uri:bdev_destroy{uri="nvmf://127.0.0.1:4420/replica0"}:nexus_uri.rs:79] close time.busy=143µs time.idle=99.1µs
 INFO mayastor::core::bdev:bdev.rs:168] Received remove event for bdev 127.0.0.1:4420/replica0n1
 INFO mayastor::bdev::nexus::nexus_child:nexus_child.rs:370] Removing child nvmf://127.0.0.1:4420/replica0
 INFO mayastor::bdev::nexus::nexus_child:nexus_child.rs:406] Child nvmf://127.0.0.1:4420/replica0 removed

While this was happening in a different terminal, I've observed the NVMe queue's as the kernel sees them using the following bt:

tracepoint:nvme:nvme_complete_rq / args->ctrl_id == $1 && args->status != 0 / {

if (args->qid != 0){
    printf("<-- device: %s, retries: %d, status: %x, cid=%d, qid=%d\n",
        args->disk, args->retries, args->status, args->cid, args->qid);
    } else {
    printf("<-- controller: %d, retries: %d, status: %x, cid=%d, qid=%d\n",
        args->ctrl_id, args->retries, args->status, args->cid, args->qid);
    }
}

tracepoint:nvme:nvme_complete_rq / args->ctrl_id == $1 && args->retries != 0 / {

if (args->qid != 0){
    printf("<-- retried: %s, retries: %d, status: %x, cid=%d, qid=%d\n",
        args->disk, args->retries, args->status, args->cid, args->qid);
    } else {
    printf("<-- controller: %d, retries: %d, status: %x, cid=%d, qid=%d\n",
        args->ctrl_id, args->retries, args->status, args->cid, args->qid);
    }
}

This resulted in the following output:

sudo bpftrace nvme_complete_rq.bt 3                                                                                                 
Attaching 3 probes...
<-- device: nvme3n1, retries: 0, status: 6, cid=14, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=54, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=7, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=11, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=6, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=10, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=12, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=13, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=49, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=51, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=52, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=16, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=25, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=33, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=17, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=34, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=36, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=27, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=28, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=30, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=31, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=32, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=55, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=18, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=60, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=35, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=20, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=21, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=57, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=58, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=59, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=61, qid=37
<-- device: nvme3n1, retries: 0, status: 6, cid=62, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=16, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=52, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=51, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=49, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=13, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=12, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=10, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=6, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=32, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=31, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=30, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=28, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=27, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=36, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=34, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=7, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=14, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=54, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=11, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=25, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=33, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=17, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=62, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=61, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=59, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=58, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=57, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=21, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=20, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=35, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=60, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=55, qid=37
<-- retried: nvme3n1, retries: 1, status: 0, cid=18, qid=37

The kernel retries all the IO that failed. This also indicates that if one of the replicas goes away, the error it produces is sent back to the initiator to retry the IO. We might want to make the retries implicit -- i.e have the be retried internally but for now, I think it is fair to close this issues as it particularly suggests that we should not crash which we AFAIK, don't do (and actually never did). This is demonstrated by the test-cases and the above example.

Feel free to reopen if it occurs again.

openebs / mayastor

Bringing down one of the replicas renders the volume unusable #509