linux-nvme / nvme-cli

NVMe management command line interface.
https://nvmexpress.org
GNU General Public License v2.0
1.48k stars 658 forks source link

nvme 2.0: connet-all --transport=tcp fails with 'no available path - failing I/O" #1311

Closed johnmeneghini closed 2 years ago

johnmeneghini commented 2 years ago

I'm unable to get nvme connect-all --transport=tcp to work with a multi-path tcp array.

rhel-storage-08:nvme-cli(openssl-3-support) > uname -a
Linux rhel-storage-08.storage.lab.eng.bos.redhat.com 5.14.0-39_testFa.el9.x86_64+debug #1 SMP PREEMPT Tue Jan 4 15:03:30 EST 2022 x86_64 x86_64 x86_64 GNU/Linux

This is a RHEL 9 beta release that has been patched up to v5.16-r8.

rhel-storage-08:nvme-cli(openssl-3-support) > sudo .build/nvme discover --transport=tcp --trsvcid=4420 --traddr=172.16.21.240 --host-traddr=172.16.21.8 | egrep "trtype:"
[Fri Jan  7 16:11:09 2022] nvme nvme2: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.16.21.240:4420
[Fri Jan  7 16:11:09 2022] nvme nvme2: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
trtype:  fc
trtype:  fc
trtype:  fc
trtype:  fc
trtype:  tcp
trtype:  fc
trtype:  fc
trtype:  fc
trtype:  fc
trtype:  tcp

Note that the array discovery service is returning multiple discovery log page entries for both fc and tcp. We only care about the tcp entries and we might want to teach nvme connect-all how to ignore or filter different trtypes. There's no sense in trying to connect over different transports at the same time.

rhel-storage-08:nvme-cli(openssl-3-support) > sudo .build/nvme discover --transport=tcp --trsvcid=4420 --traddr=172.16.21.240 --host-traddr=172.16.21.8 | egrep -A 9 "trtype:..tcp"
[Fri Jan  7 16:16:17 2022] nvme nvme2: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.16.21.240:4420
[Fri Jan  7 16:16:17 2022] nvme nvme2: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
trtype:  tcp
adrfam:  ipv4
subtype: nvme subsystem
treq:    not specified
portid:  2304
trsvcid: 4420
subnqn:  nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94
traddr:  172.16.21.241
eflags:  not specified
sectype: none
--
trtype:  tcp
adrfam:  ipv4
subtype: nvme subsystem
treq:    not specified
portid:  2368
trsvcid: 4420
subnqn:  nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94
traddr:  172.16.21.240
eflags:  not specified
sectype: none

There are two subsystem ports accessible to the host on the network at --host-traddr=172.16.21.8

rhel-storage-08:nvme-cli(openssl-3-support) > sudo .build/nvme connect --transport=tcp --trsvcid=4420 --traddr=172.16.21.240 --host-traddr=172.16.21.8 --nqn nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94
[Fri Jan  7 16:22:12 2022] nvme nvme0: creating 12 I/O queues.
[Fri Jan  7 16:22:12 2022] nvme nvme0: mapped 12/0/0 default/read/poll queues.
[Fri Jan  7 16:22:12 2022] nvme nvme0: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94", addr 172.16.21.240:4420
rhel-storage-08:nvme-cli(openssl-3-support) > lsblk
NAME                          MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                             8:0    0 465.8G  0 disk 
├─sda1                          8:1    0     1G  0 part /boot
└─sda2                          8:2    0 464.8G  0 part 
  ├─cs_rhel--storage--08-root 253:0    0    70G  0 lvm  /
  ├─cs_rhel--storage--08-swap 253:1    0   5.9G  0 lvm  [SWAP]
  └─cs_rhel--storage--08-home 253:2    0 388.9G  0 lvm  /home
nvme0n1                       259:11   0    20G  0 disk 
nvme0n2                       259:15   0    50G  0 disk

nvme connect works fine.

rhel-storage-08:nvme-cli(openssl-3-support) > sudo .build/nvme disconnect /dev/nvme0 --nqn nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94
[Fri Jan  7 16:28:02 2022] nvme nvme0: Removing ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94"
NQN:nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94 disconnected 1 controller(s)

nvme disconnect works fine.

rhel-storage-08:nvme-cli(openssl-3-support) > sudo .build/nvme connect-all --transport=tcp --trsvcid=4420 --traddr=172.16.21.240 --host-traddr=172.16.21.8
[Fri Jan  7 16:33:31 2022] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.16.21.240:4420
failed to connect controller, error 22
[Fri Jan  7 16:33:31 2022] nvme_fabrics: missing parameter 'host_traddr=%s'
failed to connect controller, error 22
[Fri Jan  7 16:33:31 2022] nvme_fabrics: missing parameter 'host_traddr=%s'
failed to connect controller, error 22
[Fri Jan  7 16:33:31 2022] nvme_fabrics: missing parameter 'host_traddr=%s'
failed to connect controller, error 22
[Fri Jan  7 16:33:32 2022] nvme_fabrics: missing parameter 'host_traddr=%s'
[Fri Jan  7 16:33:32 2022] nvme nvme1: creating 12 I/O queues.
[Fri Jan  7 16:33:32 2022] nvme nvme1: mapped 12/0/0 default/read/poll queues.
[Fri Jan  7 16:33:32 2022] nvme nvme1: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94", addr 172.16.21.241:4420
[Fri Jan  7 16:33:32 2022] nvme nvme1: Removing ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94"
[Fri Jan  7 16:33:32 2022] block nvme1n2: no available path - failing I/O
[Fri Jan  7 16:33:32 2022] block nvme1n2: no available path - failing I/O
[Fri Jan  7 16:33:32 2022] Buffer I/O error on dev nvme1n2, logical block 13107184, async page read
failed to connect controller, error 22
[Fri Jan  7 16:33:32 2022] nvme_fabrics: missing parameter 'host_traddr=%s'
failed to connect controller, error 22
[Fri Jan  7 16:33:32 2022] nvme_fabrics: missing parameter 'host_traddr=%s'
failed to connect controller, error 22
[Fri Jan  7 16:33:32 2022] nvme_fabrics: missing parameter 'host_traddr=%s'
failed to connect controller, error 22
[Fri Jan  7 16:33:32 2022] nvme_fabrics: missing parameter 'host_traddr=%s'
[Fri Jan  7 16:33:32 2022] nvme nvme1: creating 12 I/O queues.
[Fri Jan  7 16:33:33 2022] nvme nvme1: mapped 12/0/0 default/read/poll queues.
[Fri Jan  7 16:33:33 2022] nvme nvme1: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94", addr 172.16.21.240:4420
[Fri Jan  7 16:33:33 2022] nvme nvme1: Removing ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94"
[Fri Jan  7 16:33:33 2022] block nvme1n1: no available path - failing I/O
[Fri Jan  7 16:33:33 2022] block nvme1n1: no available path - failing I/O
[Fri Jan  7 16:33:33 2022] block nvme1n2: no available path - failing I/O
[Fri Jan  7 16:33:33 2022] block nvme1n1: no available path - failing I/O
[Fri Jan  7 16:33:33 2022] block nvme1n2: no available path - failing I/O
[Fri Jan  7 16:33:33 2022] Buffer I/O error on dev nvme1n2, logical block 13107184, async page read
[Fri Jan  7 16:33:33 2022] block nvme1n1: no available path - failing I/O
[Fri Jan  7 16:33:33 2022] block nvme1n1: no available path - failing I/O
[Fri Jan  7 16:33:33 2022] Buffer I/O error on dev nvme1n1, logical block 5242767, async page read
[Fri Jan  7 16:33:33 2022] nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"

nvme connect-all gets totally confused.

Note that the legacy nvme connect-all command works fine.

rhel-storage-08:nvme-cli(openssl-3-support) > sudo nvme connect-all --transport=tcp --trsvcid=4420 --traddr=172.16.21.240 --host-traddr=172.16.21.8
[Fri Jan  7 10:38:45 2022] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.16.21.240:4420
[Fri Jan  7 10:38:45 2022] nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[Fri Jan  7 10:38:45 2022] nvme_fc: nvme_fc_parse_traddr: bad traddr string
Failed to write to /dev/nvme-fabrics: Invalid argument
Failed to write to /dev/nvme-fabrics: Invalid argument
[Fri Jan  7 10:38:45 2022] nvme_fc: nvme_fc_parse_traddr: bad traddr string
Failed to write to /dev/nvme-fabrics: Invalid argument
[Fri Jan  7 10:38:45 2022] nvme_fc: nvme_fc_parse_traddr: bad traddr string
Failed to write to /dev/nvme-fabrics: Invalid argument
[Fri Jan  7 10:38:45 2022] nvme_fc: nvme_fc_parse_traddr: bad traddr string
[Fri Jan  7 10:38:46 2022] nvme nvme0: creating 12 I/O queues.
[Fri Jan  7 10:38:46 2022] nvme nvme0: mapped 12/0/0 default/read/poll queues.
[Fri Jan  7 10:38:46 2022] nvme nvme0: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94", addr 172.16.21.241:4420
Failed to write to /dev/nvme-fabrics: Invalid argument
[Fri Jan  7 10:38:46 2022] nvme_fc: nvme_fc_parse_traddr: bad traddr string
Failed to write to /dev/nvme-fabrics: Invalid argument
[Fri Jan  7 10:38:46 2022] nvme_fc: nvme_fc_parse_traddr: bad traddr string
Failed to write to /dev/nvme-fabrics: Invalid argument
[Fri Jan  7 10:38:46 2022] nvme_fc: nvme_fc_parse_traddr: bad traddr string
Failed to write to /dev/nvme-fabrics: Invalid argument
[Fri Jan  7 10:38:46 2022] nvme_fc: nvme_fc_parse_traddr: bad traddr string
[Fri Jan  7 10:38:46 2022] nvme nvme1: creating 12 I/O queues.
[Fri Jan  7 10:38:46 2022] nvme nvme1: mapped 12/0/0 default/read/poll queues.
[Fri Jan  7 10:38:46 2022] nvme nvme1: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94", addr 172.16.21.240:4420
rhel-storage-08:nvme-cli(openssl-3-support) > nvme --version
nvme version 1.14
rhel-storage-08:nvme-cli(openssl-3-support) > which nvme
/usr/sbin/nvme
rhel-storage-08:nvme-cli(openssl-3-support) > rpm -q --whatprovides /usr/sbin/nvme
nvme-cli-1.14-3.el9.x86_64
keithbusch commented 2 years ago

thanks for the report. i'm taking a look now.

keithbusch commented 2 years ago

the two versions implementation are quite different. might take a moment for me to untangle it (i don't do fabrics too often).

igaw commented 2 years ago

I run into the same problem. As far I undrstand the problem is in libnvme's nvmf_connect_disc_entry().

c = nvme_create_ctrl(e->subnqn, transport, traddr, NULL, NULL, trsvcid); https://github.com/linux-nvme/libnvme/blob/6b951c53cc4c978b0617e392c776aa16400f7d63/src/nvme/fabrics.c#L651

struct nvme_ctrl *nvme_create_ctrl(const char *subsysnqn, const char *transport,
                   const char *traddr, const char *host_traddr,
                   const char *host_iface, const char *trsvcid)

https://github.com/linux-nvme/libnvme/blob/6b951c53cc4c978b0617e392c776aa16400f7d63/src/nvme/tree.c#L952

igaw commented 2 years ago

I'm testing 'connect-all' but it still fails. cfg->host_traddr and cfg->host_iface was still NULL. My workaround is:

--- a/fabrics.c
+++ b/fabrics.c
@@ -396,6 +396,8 @@ static int discover_from_conf_file(nvme_host_t h, const char *desc,
                errno = 0;
                ret = nvmf_add_ctrl(h, c, &cfg, false);
                if (!ret) {
+                 cfg.host_traddr = host_traddr;
+                 cfg.host_iface = host_iface;
                        __discover(c, &cfg, raw, connect,
                                   persistent, flags);
                        if (!persistent)

But that still fails:

dolin:~/nvme-cli/.build/:[255]# ./nvme connect-all
Failed to read /etc/nvme/config.json, json_object_from_file: error opening file /etc/nvme/config.json: No such file or directory

connect ctrl, 'nqn=nqn.2014-08.org.nvmexpress.discovery,transport=fc,traddr=nn-0x201700a09890f5bf:pn-0x201900a09890f5bf,host_traddr=nn-0x200000109b579ef3:pn-0x100000109b579ef3,hostnqn=nqn.2014-08.org.nvmexpress:uuid:1a9e23dd-466e-45ca-9f43-a29aaf47cb21,hostid=1a9e23dd-466e-45ca-9f43-a29aaf47cb21,ctrl_loss_tmo=600'
connect ctrl, response 'instance=0,cntlid=17088'
nvme0: ctrl connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme0
nvme0: discover length 256
nvme0: discover length 5120
nvme0: discover genctr 6573, retry
lookup ctrl (transport: fc, traddr: nn-0x201700a09890f5bf:pn-0x201800a09890f5bf, trsvcid (null))
connect ctrl, 'nqn=nqn.1992-08.com.netapp:sn.d646dc63336511e995cb00a0988fb732:subsystem.nvme-svm-dolin-ana_subsystem,transport=fc,traddr=nn-0x201700a09890f5bf:pn-0x201800a09890f5bf,host_traddr=nn-0x200000109b579ef3:pn-0x100000109b579ef3,hostnqn=nqn.2014-08.org.nvmexpress:uuid:1a9e23dd-466e-45ca-9f43-a29aaf47cb21,hostid=1a9e23dd-466e-45ca-9f43-a29aaf47cb21,ctrl_loss_tmo=600'
connect ctrl, response 'instance=1,cntlid=31424'
nvme1: ctrl connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme1
nvme1: disconnected
lookup ctrl (transport: fc, traddr: nn-0x201700a09890f5bf:pn-0x201900a09890f5bf, trsvcid (null))
connect ctrl, 'nqn=nqn.1992-08.com.netapp:sn.d646dc63336511e995cb00a0988fb732:subsystem.nvme-svm-dolin-ana_subsystem,transport=fc,traddr=nn-0x201700a09890f5bf:pn-0x201900a09890f5bf,host_traddr=nn-0x200000109b579ef3:pn-0x100000109b579ef3,hostnqn=nqn.2014-08.org.nvmexpress:uuid:1a9e23dd-466e-45ca-9f43-a29aaf47cb21,hostid=1a9e23dd-466e-45ca-9f43-a29aaf47cb21,ctrl_loss_tmo=600'
connect ctrl, response 'instance=1,cntlid=31488'
nvme1: ctrl connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme1
nvme1: disconnected
lookup ctrl (transport: fc, traddr: nn-0x201700a09890f5bf:pn-0x201800a09890f5bf, trsvcid (null))
connect ctrl, 'nqn=nqn.1992-08.com.netapp:sn.d646dc63336511e995cb00a0988fb732:subsystem.nvme-svm-dolin-ana_subsystem_mwilck,transport=fc,traddr=nn-0x201700a09890f5bf:pn-0x201800a09890f5bf,host_traddr=nn-0x200000109b579ef3:pn-0x100000109b579ef3,hostnqn=nqn.2014-08.org.nvmexpress:uuid:1a9e23dd-466e-45ca-9f43-a29aaf47cb21,hostid=1a9e23dd-466e-45ca-9f43-a29aaf47cb21,ctrl_loss_tmo=600'
connect ctrl, response 'instance=1,cntlid=24000'
nvme1: ctrl connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme1
nvme1: disconnected
lookup ctrl (transport: fc, traddr: nn-0x201700a09890f5bf:pn-0x201900a09890f5bf, trsvcid (null))
connect ctrl, 'nqn=nqn.1992-08.com.netapp:sn.d646dc63336511e995cb00a0988fb732:subsystem.nvme-svm-dolin-ana_subsystem_mwilck,transport=fc,traddr=nn-0x201700a09890f5bf:pn-0x201900a09890f5bf,host_traddr=nn-0x200000109b579ef3:pn-0x100000109b579ef3,hostnqn=nqn.2014-08.org.nvmexpress:uuid:1a9e23dd-466e-45ca-9f43-a29aaf47cb21,hostid=1a9e23dd-466e-45ca-9f43-a29aaf47cb21,ctrl_loss_tmo=600'
connect ctrl, response 'instance=1,cntlid=24064'
nvme1: ctrl connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme1
nvme1: disconnected
nvme0: disconnected
connect ctrl, 'nqn=nqn.2014-08.org.nvmexpress.discovery,transport=fc,traddr=nn-0x201700a09890f5bf:pn-0x201900a09890f5bf,host_traddr=nn-0x200000109b579ef6:pn-0x100000109b579ef6,hostnqn=nqn.2014-08.org.nvmexpress:uuid:1a9e23dd-466e-45ca-9f43-a29aaf47cb21,hostid=1a9e23dd-466e-45ca-9f43-a29aaf47cb21,ctrl_loss_tmo=600'
connect ctrl, response 'instance=0,cntlid=17152'
nvme0: ctrl connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme0
nvme0: discover length 256
nvme0: discover length 5120
nvme0: discover genctr 6573, retry
lookup ctrl (transport: fc, traddr: nn-0x201700a09890f5bf:pn-0x201800a09890f5bf, trsvcid (null))
connect ctrl, 'nqn=nqn.1992-08.com.netapp:sn.d646dc63336511e995cb00a0988fb732:subsystem.nvme-svm-dolin-ana_subsystem,transport=fc,traddr=nn-0x201700a09890f5bf:pn-0x201800a09890f5bf,host_traddr=nn-0x200000109b579ef6:pn-0x100000109b579ef6,hostnqn=nqn.2014-08.org.nvmexpress:uuid:1a9e23dd-466e-45ca-9f43-a29aaf47cb21,hostid=1a9e23dd-466e-45ca-9f43-a29aaf47cb21,ctrl_loss_tmo=600'
connect ctrl, response 'instance=1,cntlid=31552'
nvme1: ctrl connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme1
nvme1: disconnected
lookup ctrl (transport: fc, traddr: nn-0x201700a09890f5bf:pn-0x201900a09890f5bf, trsvcid (null))
connect ctrl, 'nqn=nqn.1992-08.com.netapp:sn.d646dc63336511e995cb00a0988fb732:subsystem.nvme-svm-dolin-ana_subsystem,transport=fc,traddr=nn-0x201700a09890f5bf:pn-0x201900a09890f5bf,host_traddr=nn-0x200000109b579ef6:pn-0x100000109b579ef6,hostnqn=nqn.2014-08.org.nvmexpress:uuid:1a9e23dd-466e-45ca-9f43-a29aaf47cb21,hostid=1a9e23dd-466e-45ca-9f43-a29aaf47cb21,ctrl_loss_tmo=600'
connect ctrl, response 'instance=1,cntlid=31616'
nvme1: ctrl connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme1
nvme1: disconnected
lookup ctrl (transport: fc, traddr: nn-0x201700a09890f5bf:pn-0x201800a09890f5bf, trsvcid (null))
connect ctrl, 'nqn=nqn.1992-08.com.netapp:sn.d646dc63336511e995cb00a0988fb732:subsystem.nvme-svm-dolin-ana_subsystem_mwilck,transport=fc,traddr=nn-0x201700a09890f5bf:pn-0x201800a09890f5bf,host_traddr=nn-0x200000109b579ef6:pn-0x100000109b579ef6,hostnqn=nqn.2014-08.org.nvmexpress:uuid:1a9e23dd-466e-45ca-9f43-a29aaf47cb21,hostid=1a9e23dd-466e-45ca-9f43-a29aaf47cb21,ctrl_loss_tmo=600'
connect ctrl, response 'instance=1,cntlid=24128'
nvme1: ctrl connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme1
nvme1: disconnected
lookup ctrl (transport: fc, traddr: nn-0x201700a09890f5bf:pn-0x201900a09890f5bf, trsvcid (null))
connect ctrl, 'nqn=nqn.1992-08.com.netapp:sn.d646dc63336511e995cb00a0988fb732:subsystem.nvme-svm-dolin-ana_subsystem_mwilck,transport=fc,traddr=nn-0x201700a09890f5bf:pn-0x201900a09890f5bf,host_traddr=nn-0x200000109b579ef6:pn-0x100000109b579ef6,hostnqn=nqn.2014-08.org.nvmexpress:uuid:1a9e23dd-466e-45ca-9f43-a29aaf47cb21,hostid=1a9e23dd-466e-45ca-9f43-a29aaf47cb21,ctrl_loss_tmo=600'
connect ctrl, response 'instance=1,cntlid=24192'
nvme1: ctrl connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme1
nvme1: disconnected
nvme0: disconnected
connect ctrl, 'nqn=nqn.2014-08.org.nvmexpress.discovery,transport==fc --traddr=nn-0x201700a09890f5bf:pn-0x201b00a09890f5bf --host-traddr=nn-0x200000109b579ef6:pn-0x100000109b579ef6 '
Failed to write to /dev/nvme-fabrics: Invalid argument
igaw commented 2 years ago

The kernel is complaining with

[765139.069194] nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[765139.091684] nvme_fabrics: no handler found for transport =fc --traddr=nn-0x201700a09890f5bf:pn-0x201b00a09890f5bf --host-traddr=nn-0x200000109b579ef6:pn-0x100000109b579ef6 
hreinecke commented 2 years ago

Ah. The 'connect' string isn't parsed correctly.

hreinecke commented 2 years ago

Correct fix should be

diff --git a/fabrics.c b/fabrics.c index 4ad5291..af957ba 100644 --- a/fabrics.c +++ b/fabrics.c @@ -81,8 +81,8 @@ static const char *nvmf_config_file = "Use specified JSON configuration file or OPT_STRING("transport", 't', "STR", &transport, nvmf_tport), \ OPT_STRING("traddr", 'a', "STR", &traddr, nvmf_traddr), \ OPT_STRING("trsvcid", 's', "STR", &trsvcid, nvmf_trsvcid), \

  • OPT_STRING("host-traddr", 'w', "STR", &host_traddr, nvmf_htraddr), \
  • OPT_STRING("host-iface", 'f', "STR", &host_iface, nvmf_hiface), \
  • OPT_STRING("host-traddr", 'w', "STR", &c.host_traddr, nvmf_htraddr), \
  • OPT_STRING("host-iface", 'f', "STR", &c.host_iface, nvmf_hiface), \ OPT_STRING("hostnqn", 'q', "STR", &hostnqn, nvmf_hostnqn), \ OPT_STRING("hostid", 'I', "STR", &hostid, nvmf_hostid), \ OPT_STRING("nqn", 'n', "STR", &subsysnqn, nvmf_nqn), \

Would've done it myself if I knew how I can teach meson to update libnvme.

hreinecke commented 2 years ago

Please check PR #1318 .

igaw commented 2 years ago

Would've done it myself if I knew how I can teach meson to update libnvme.

When you do the first meson .build (or make), libnvme will be checkout out as normal git tree under subprojects/libnvme. After the initial checkout meson doesn't touch this git tree unless you do something like meson subproject update. There is nothing magically going on :)

That means you can do any git operation after the initial checkout as you like.

hreinecke commented 2 years ago

Figured it out meanwhile. Please check the above PR.

igaw commented 2 years ago

The parsing error is gone. Still no connection after 'connect-all'.

The kernel says:

[769021.443721] nvme nvme1: NVME-FC{1}: create association : host wwpn 0x100000109b579ef6  rport wwpn 0x201900a09890f5bf: NQN "nqn.1992-08.com.netapp:sn.d646dc63336511e995cb00a0988fb732:subsystem.nvme-svm-dolin-ana_subsystem_mwilck"
[769022.397219] nvme nvme1: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
[769022.897298] nvme nvme1: NVME-FC{1}: controller connect complete
[769022.897339] nvme nvme1: NVME-FC{1}: new ctrl: NQN "nqn.1992-08.com.netapp:sn.d646dc63336511e995cb00a0988fb732:subsystem.nvme-svm-dolin-ana_subsystem_mwilck"
[769022.897615] nvme nvme1: Removing ctrl: NQN "nqn.1992-08.com.netapp:sn.d646dc63336511e995cb00a0988fb732:subsystem.nvme-svm-dolin-ana_subsystem_mwilck"
[769023.206954] block nvme1n1: no available path - failing I/O
[769023.206960] block nvme1n1: no available path - failing I/O
[769023.206963] Buffer I/O error on dev nvme1n1, logical block 8388592, async page read
[769023.277244] nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"

and nvme-cli:

lookup ctrl (transport: fc, traddr: nn-0x201700a09890f5bf:pn-0x201900a09890f5bf, trsvcid (null))
connect ctrl, 'nqn=nqn.1992-08.com.netapp:sn.d646dc63336511e995cb00a0988fb732:subsystem.nvme-svm-dolin-ana_subsystem_mwilck,transport=fc,traddr=nn-0x201700a09890f5bf:pn-0x201900a09890f5bf,host_traddr=nn-0x200000109b579ef6:pn-0x100000109b579ef6,hostnqn=nqn.2014-08.org.nvmexpress:uuid:1a9e23dd-466e-45ca-9f43-a29aaf47cb21,hostid=1a9e23dd-466e-45ca-9f43-a29aaf47cb21,ctrl_loss_tmo=600'
connect ctrl, response 'instance=1,cntlid=25984'
nvme1: ctrl connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme1
nvme1: disconnected
nvme0: disconnected

adding some more tracing...

hreinecke commented 2 years ago

Looks as if nvme-cli disconnects the controller immediately after connecting ...

igaw commented 2 years ago

This is the output from 'connect':

[769451.837742] nvme nvme0: NVME-FC{0}: create association : host wwpn 0x100000109b579ef3  rport wwpn 0x201900a09890f5bf: NQN "nqn.2014-08.org.nvmexpress.discovery"
[769452.532383] nvme nvme0: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
[769452.532389] nvme nvme0: NVME-FC{0}: controller connect complete
[769452.532428] nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[769454.033336] nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
[769460.316838] nvme nvme0: NVME-FC{0}: create association : host wwpn 0x100000109b579ef3  rport wwpn 0x201900a09890f5bf: NQN "nqn.1992-08.com.netapp:sn.d646dc63336511e995cb00a0988fb732:subsystem.nvme-svm-dolin-ana_subsystem_mwilck"
[769461.237133] nvme nvme0: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
[769461.737435] nvme nvme0: NVME-FC{0}: controller connect complete
[769461.737483] nvme nvme0: NVME-FC{0}: new ctrl: NQN "nqn.1992-08.com.netapp:sn.d646dc63336511e995cb00a0988fb732:subsystem.nvme-svm-dolin-ana_subsystem_mwilck"

It looks like the disconnect should be from the discovery controller not the new controller.

hreinecke commented 2 years ago
      if (child) {

if (discover) __discover(child, defcfg, raw, persistent, true, flags); if (!persistent) { nvme_disconnect_ctrl(child); nvme_free_ctrl(child); }

The 'if (discover)' ... 'if (!persistent)' conditions look dodgy; seems like we would disconnect non-discovery controllers here...

igaw commented 2 years ago

yep, that's where we disconnect.

hreinecke commented 2 years ago

Fix pushed to PR #1318 . Please test.

igaw commented 2 years ago

Works with the latest version from #1318.

igaw commented 2 years ago

Fixes merged. Closing bug. If it still failing please reopen.

johnmeneghini commented 2 years ago

I just want to report that I've tested out these changes and they work great! Thank for fixing this bug.

[root@rhel-storage-08 nvme-cli]# .build/nvme connect-all --transport=tcp --trsvcid=4420 --traddr=172.16.21.241 --host-traddr=172.16.21.108
[609852.098034] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.16.21.241:4420
failed to connect controller, error 22
[609852.182222] nvme_fc: nvme_fc_parse_traddr: bad traddr string
[609852.226806] nvme_fc: nvme_fc_parse_traddr: bad traddr string
failed to connect controller, error 22
failed to connect controller, error 22
[609852.272140] nvme_fc: nvme_fc_parse_traddr: bad traddr string
failed to connect controller, error 22
[609852.317337] nvme_fc: nvme_fc_parse_traddr: bad traddr string
[609852.391945] nvme nvme3: creating 12 I/O queues.
[609852.425913] nvme nvme3: mapped 12/0/0 default/read/poll queues.
[609852.459345] nvme nvme3: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94", addr 172.16.21.241:4420
failed to connect controller, error 22
[609852.518580] nvme_fc: nvme_fc_parse_traddr: bad traddr string
[609852.562760] nvme_fc: nvme_fc_parse_traddr: bad traddr string
failed to connect controller, error 22
failed to connect controller, error 22
[609852.606262] nvme_fc: nvme_fc_parse_traddr: bad traddr string
failed to connect controller, error 22
[609852.651592] nvme_fc: nvme_fc_parse_traddr: bad traddr string
[609852.727124] nvme nvme4: creating 12 I/O queues.
[609852.758512] nvme nvme4: mapped 12/0/0 default/read/poll queues.
[609852.791584] nvme nvme4: new ctrl: NQN "nqn.1988-11.com.dell:powerstore:00:88b402df2d762AA7AF94", addr 172.16.21.240:4420
[609852.816619] nvme nvme0: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
igaw commented 2 years ago

Thanks for reporting and testing!