Open davidrohr opened 8 months ago
Not sure this will make any difference for RDMA. I'm more familiar with TCP.
For TCP, the Service ID used for Discovery is typically 8009. And the Service ID for I/O is 4420. You could try nvme discover -t rdma -a [IP] -s 8009
to see if that makes a difference.
This looks like the RMDA transport is having an issue. libnvme/nvme-cli doesn't differentiate between the transports. Also the nvme subsystem core doesn't differentiate between the transport.
What you could do is to dump the discovery log page for TCP and RDMA (nvme get-log
). This would give us a clue.
Also try to add '-vvv' when doing the nvme discover
command. This also sheds some light on nvme_discovery_log
FWIW, Linux's nvmet implementation (aka soft target) doesn't implement the discovery controller on port 8009.
For the service ID: I am explicity setting this when I set up the nvmet on the server. I have changed it to 8009, but it doesn't make any difference. Also, I am quite sure it connects to the right service ID, since when the ID differs between client and server, nvme discover / connect complain that they cannot connect.
I ran nvme discover -vvv [...]
. The logs are attached, but don't tell anything to me.
logs.tar.gz
Could you be more specific which log page to obtain with nvme get-log
, i.e. which log-id and how many bytes. I could not find any information which one is the discovery log.
The logs show that the initial get header already is corrupted for RDMA. So it's not that the loop is going wrong.
TCP:
nvme3: get header (try 0/10)
nvme3: get 4 records (genctr 9)
nvme3: get header again
RDMA:
nvme3: get header (try 0/10)
could not alloc memory for discovery log page
failed to get discovery log: Cannot allocate memory
I just realized we need a block device in order to run the get-log
command. So you need first to connected to the discovery controller and then do a nvme get-log
:
# nvme connect -t rdma -a XXX -s 4420 -n nqn.2014-08.org.nvmexpress.discovery
[ get the nvme decvice from dmesg, it wont show up in nvme list]
# nvme get-log /dev/nvme$X -i 0x70 -l 4096
The logs show that the initial get header already is corrupted for RDMA. So it's not that the loop is going wrong.
Yes, as I wrote in the original post, the numrec field in the header has a bogus value. Thus it computes a very large size to allocate for the log page, and fails the allocation.
# nvme get-log /dev/nvme$X -i 0x70 -l 4096
thx, the 0x70 was what I was missing. Will get that tonight connecting to the server both via TCP and via RDMA, and will post the logs here.
Thx for the quick support!
I was just making sure the loop works. We had some bugs in this loop thus I am a bit paranoid.
FWIW, in the base spec the output is specified in 5.16.1.23 Discovery Log Page (Log Page Identifier 70h)
I cannot get a log for id 70h / 112:
~ # nvme get-log /dev/nvme0n1 --log-id=112 --log-len 4096
NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x6002)
~ # nvme get-log /dev/nvme0n1 --log-id=70h --log-len 4096
NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x6002)
While e.g. obtaining the log for id 2 (smart) works, so the nvme get-log is working in general, just not for id 70h.
Are you sure you connected to the discovery controller. A discovery controller doesn't have name spaces. So it's important to use the correct NQN: -n nqn.2014-08.org.nvmexpress.discovery
The header is not necessarily corrupted. We never evaluate the CQE status when calling nvme_get_log_page(); it surely will return an errno when the ioctl fails, but unfortunately an ioctl success just means that we got an CQE correctly, not that the CQE does not have an error status ...
Check if https://github.com/linux-nvme/libnvme/pull/810 helps here.
-n nqn.2014-08.org.nvmexpress.discovery
I am not fully sure what to do :(:
~ # nvme list
Node Generic SN Model Namespace Usage Format FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme0n1 /dev/ng0n1 6722de001c9e5e3af9e8 Linux 0x1 8.00 TB / 8.00 TB 512 B + 0 B 6.8.2-ge
~ # nvme get-log /dev/nvme0n1 --log-id=70h --log-len=512 -n 1
NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x6002)
~ # nvme get-log /dev/nvme0n1 --log-id=70h --log-len=512 -n nqn.2014-08.org.nvmexpress.discovery
Expected word argument for 'namespace-id' but got 'nqn.2014-08.org.nvmexpress.discovery'!
~ # nvme get-log /dev/nvme0n1 --log-id=70h --log-len=512
NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x6002)
I have done nvme connect
, and the device is there. I can read it with dd
, but mounting works only when I connect via tcp
, not rdma
. Where shall I use -n nqn.2014-08.org.nvmexpress.discovery
? As you see above, nvme get-log
doesn't accept that option, it expects a namespace. If I put in the namespace 1 of the device, it still doesn't work.
'discovery' works without a controller connected. You could try with
nvme discover -t rdma -a [IP] -s 4420
'discovery' works without a controller connected. You could try with
nvme discover -t rdma -a [IP] -s 4420
But this is what I did before, see the logs here: https://github.com/linux-nvme/libnvme/issues/807#issuecomment-2033022371 It works if I connect over tcp, but over rdma it fails with
could not alloc memory for discovery log page
failed to get discovery log: Cannot allocate memory
Check if #810 helps here.
I tried with this PR on top of master, but I am still getting:
~ # nvme discover -t rdma -a 192.168.110.9 -s 4421
could not alloc memory for discovery log page
failed to get discovery log: Cannot allocate memory
I deployed the new version only on the client so far, do I also need to update it on the server?
Can you try with this patch:
diff --git a/src/nvme/fabrics.c b/src/nvme/fabrics.c
index 96ba9d4b..44a8ac62 100644
--- a/src/nvme/fabrics.c
+++ b/src/nvme/fabrics.c
@@ -1106,7 +1106,7 @@ retry_header:
log = __nvme_alloc(sizeof(*log) + entries_size);
if (!log) {
nvme_msg(r, LOG_ERR,
- "could not alloc memory for discovery log page\n");
+ "could not alloc %" PRIu64 " discovery log records\n", numrec);
errno = ENOMEM;
return NULL;
}
Can you try with this patch:
This gives
qon /usr/portage/sys-libs/libnvme # nvme discover -t rdma -a 192.168.110.9 -s 4421
could not alloc 7004333495895549496 discovery log records
failed to get discovery log: Cannot allocate memory
As I wrote in the original post, this numrec entry seems always bogus.
Again, here the exact commands you need to execute:
# nvme connect -t rdma -a XXX -s 4420 -n nqn.2014-08.org.nvmexpress.discovery
[ get the nvme decvice from dmesg, it wont show up in nvme list]
# nvme get-log /dev/nvme$X -i 0x70 -l 4096
Note the device is /dev/nvme1
and not /dev/nvme1n1
. And you wont see this device listed by nvme list
. dmesg
shows when you have connected to the discovery controller.
Again, here the exact commands you need to execute:
@igaw : OK, thx a lot, now I understand, wasn't clear to me that the discovery controler is an nvme device by itself.
Please find the logs here: logs.tar.gz
I set up nvmet for tcp over service 4420 and over rdma over 4421, both exposing 3 subsystems with 1 NVME device each. The log contains the 0x70 page for both of them. (Note that I tried to have 1 subsystem with 3 devices, or 3 subsystems with 1 device, and I also changed the service numbers, i.e. had both of them on 4420 separately, but this doesn't change anything. I am only including the log for one case.)
───────┬─────────────────────────────────────────────────────────────────────────────────────────
│ File: tcp.log
───────┼─────────────────────────────────────────────────────────────────────────────────────────
1 │ Device:nvme3 log-id:112 namespace-id:0xffffffff
2 │ 0 1 2 3 4 5 6 7 8 9 a b c d e f
3 │ 0000: 09 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00 "................"
───────┴─────────────────────────────────────────────────────────────────────────────────────────
───────┬─────────────────────────────────────────────────────────────────────────────────────────
│ File: rdma.log
───────┼─────────────────────────────────────────────────────────────────────────────────────────
1 │ Device:nvme4 log-id:112 namespace-id:0xffffffff
2 │ 0 1 2 3 4 5 6 7 8 9 a b c d e f
3 │ 0000: 09 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00 "................"
───────┴─────────────────────────────────────────────────────────────────────────────────────────
Both transport have the same header, genctr
is 9 and numrec
is 4. But the command libnvme issues gets something wrong for rdma. Very strange.
@davidrohr could you retry with the latest version from the git trees (libnvme and nvme-cli). @hreinecke updated the status and error handling in the discovery code. There was a bit of confusion in the error handling.
see the readme of nvme-cli to make sure you get the libnvme correct.
@igaw : Thx, I rebuild libnvme and nvme-cli from the upstream repo, and now the discovery over rdma is failing with Protocol erro
. Also the nvme get-log
command fails with the same error now over RDMA, while everything works over TCP.
See the full log here:
qon /usr/portage/sys-apps/nvme-cli # nvme discover -t tcp -a 192.168.110.9 -s 4420
Discovery Log Number of Records 4, Generation counter 9
=====Discovery Log Entry 0======
trtype: tcp
adrfam: ipv4
subtype: current discovery subsystem
treq: not specified, sq flow control disable supported
portid: 1
trsvcid: 4420
subnqn: nqn.2014-08.org.nvmexpress.discovery
traddr: 192.168.110.9
eflags: none
sectype: none
=====Discovery Log Entry 1======
trtype: tcp
adrfam: ipv4
subtype: nvme subsystem
treq: not specified, sq flow control disable supported
portid: 1
trsvcid: 4420
subnqn: qon1
traddr: 192.168.110.9
eflags: none
sectype: none
=====Discovery Log Entry 2======
trtype: tcp
adrfam: ipv4
subtype: nvme subsystem
treq: not specified, sq flow control disable supported
portid: 1
trsvcid: 4420
subnqn: qon2
traddr: 192.168.110.9
eflags: none
sectype: none
=====Discovery Log Entry 3======
trtype: tcp
adrfam: ipv4
subtype: nvme subsystem
treq: not specified, sq flow control disable supported
portid: 1
trsvcid: 4420
subnqn: qon3
traddr: 192.168.110.9
eflags: none
sectype: none
qon /usr/portage/sys-apps/nvme-cli # nvme discover -t rdma -a 192.168.110.9 -s 4421
failed to identify controller, error Protocol error
failed to add controller, error Protocol error
qon /usr/portage/sys-apps/nvme-cli # nvme connect -n nqn.2014-08.org.nvmexpress.discovery -t rdma -a 192.168.110.9 -s 4421
connecting to device: nvme3
qon /usr/portage/sys-apps/nvme-cli # nvme get-log /dev/nvme3 -i 0x70 -l 8192 > rdma.log
log page: Protocol error
qon /usr/portage/sys-apps/nvme-cli # nvme connect -n nqn.2014-08.org.nvmexpress.discovery -t tcp -a 192.168.110.9 -s 4420
connecting to device: nvme4
qon /usr/portage/sys-apps/nvme-cli # nvme get-log /dev/nvme4 -i 0x70 -l 8192 > rdma.log
qon /usr/portage/sys-apps/nvme-cli # cat rdma.log
Device:nvme4 log-id:112 namespace-id:0xffffffff
0 1 2 3 4 5 6 7 8 9 a b c d e f
0000: 09 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00 "................"
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
[...]
Alright, this means the error handling catches now the failing commands and we don't blindly continue. This is better but obviously we haven't figured out why commands fail.
The first failing command is already the identify controller. I don't really see how TCP can succeed and RDMA fails. I think it's possible to collect the wire conversation between host and target using wireshark/tcpdump. Could you try to record the RDMA attempt?
Hm, I just tried with wireshark, and I can capture the traffic when connecting over TCP (also connecting with TCP over IPoIB on the Infiniband interface). But when discovering with RDMA, wireshark does not catch anything. Also the
qon ~ # nvme connect -n nqn.2014-08.org.nvmexpress.discovery -t rdma -a 192.168.110.9 -s 4421
connecting to device: nvme3
which apparently succeeds does not show anything in wireshark.
I can have a look another day if I manage to record something, or if I can trouble-shoot it by adjusting the libnvme code.
Thanks for the experiment. I haven't played with RDMA yet. So it was just a guess it might work. Maybe post the output from the discover command with -vvv
. Meanwhile I try to figure out how to proceed.
ok, interestingly, trying a few times, I have seen that now sometimes I get the protocol error
, but sometimes the memory allocation error I had before. Not sure why yesterday I was getting always the protocol error. Anyhow, logs with -vvv
for both cases are below.
qon ~ # nvme discover -t rdma -a 192.168.110.9 -s 4421 -vvv
scan controller nvme0
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme0
scan controller nvme0 path nvme0c0n1
scan controller nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme1
scan controller nvme1 path nvme1c1n1
scan controller nvme2
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme2
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme2
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys2/nvme2
scan controller nvme2 path nvme2c2n1
scan subsystem nvme-subsys0
scan subsystem nvme-subsys0 namespace nvme0n1
scan subsystem nvme-subsys1
scan subsystem nvme-subsys1 namespace nvme1n1
scan subsystem nvme-subsys2
scan subsystem nvme-subsys2 namespace nvme2n1
Error opening /etc/nvme/config.json, No such file or directory
kernel supports: instance cntlid transport traddr trsvcid nqn queue_size nr_io_queues reconnect_delay ctrl_loss_tmo keep_alive_tmo hostnqn host_traddr host_iface hostid duplicate_connect disable_sqflow hdr_digest data_digest nr_write_queues nr_poll_queues tos fast_io_fail_tmo discovery
option "keyring" ignored
connect ctrl, 'nqn=nqn.2014-08.org.nvmexpress.discovery,transport=rdma,traddr=192.168.110.9,trsvcid=4421,hostnqn=nqn.2014-08.org.nvmexpress:uuid:990e5a47-9734-32f6-9ee6-3497f631db95,hostid=990e5a47-9734-32f6-9ee6-3497f631db95,ctrl_loss_tmo=600'
connect ctrl, response 'instance=3,cntlid=4'
nvme3: nqn.2014-08.org.nvmexpress.discovery connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme3
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme3
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys2/nvme3
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys3/nvme3
opcode : 06
flags : 00
rsvd1 : 0000
nsid : 00000000
cdw2 : 00000000
cdw3 : 00000000
data_len : 00001000
metadata_len : 00000000
addr : 7ffd78a76b90
metadata : 0
cdw10 : 00000001
cdw11 : 00000000
cdw12 : 00000000
cdw13 : 00000000
cdw14 : 00000000
cdw15 : 00000000
timeout_ms : 00000000
result : 050dff80
err : 0
latency : 71 us
failed to identify controller, error Protocol error
nvme3: nqn.2014-08.org.nvmexpress.discovery disconnected
failed to add controller, error Protocol error
qon ~ # nvme discover -t rdma -a 192.168.110.9 -s 4421 -vvv
scan controller nvme0
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme0
scan controller nvme0 path nvme0c0n1
scan controller nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme1
scan controller nvme1 path nvme1c1n1
scan controller nvme2
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme2
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme2
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys2/nvme2
scan controller nvme2 path nvme2c2n1
scan subsystem nvme-subsys0
scan subsystem nvme-subsys0 namespace nvme0n1
scan subsystem nvme-subsys1
scan subsystem nvme-subsys1 namespace nvme1n1
scan subsystem nvme-subsys2
scan subsystem nvme-subsys2 namespace nvme2n1
Error opening /etc/nvme/config.json, No such file or directory
kernel supports: instance cntlid transport traddr trsvcid nqn queue_size nr_io_queues reconnect_delay ctrl_loss_tmo keep_alive_tmo hostnqn host_traddr host_iface hostid duplicate_connect disable_sqflow hdr_digest data_digest nr_write_queues nr_poll_queues tos fast_io_fail_tmo discovery
option "keyring" ignored
connect ctrl, 'nqn=nqn.2014-08.org.nvmexpress.discovery,transport=rdma,traddr=192.168.110.9,trsvcid=4421,hostnqn=nqn.2014-08.org.nvmexpress:uuid:990e5a47-9734-32f6-9ee6-3497f631db95,hostid=990e5a47-9734-32f6-9ee6-3497f631db95,ctrl_loss_tmo=600'
connect ctrl, response 'instance=3,cntlid=4'
nvme3: nqn.2014-08.org.nvmexpress.discovery connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme3
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme3
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys2/nvme3
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys3/nvme3
opcode : 06
flags : 00
rsvd1 : 0000
nsid : 00000000
cdw2 : 00000000
cdw3 : 00000000
data_len : 00001000
metadata_len : 00000000
addr : 7ffcb771cad0
metadata : 0
cdw10 : 00000001
cdw11 : 00000000
cdw12 : 00000000
cdw13 : 00000000
cdw14 : 00000000
cdw15 : 00000000
timeout_ms : 00000000
result : 00000000
err : 0
latency : 88 us
nvme3: get header (try 0/10)
opcode : 02
flags : 00
rsvd1 : 0000
nsid : 00000000
cdw2 : 00000000
cdw3 : 00000000
data_len : 00000014
metadata_len : 00000000
addr : 5646f57a7000
metadata : 0
cdw10 : 00040070
cdw11 : 00000000
cdw12 : 00000000
cdw13 : 00000000
cdw14 : 00000000
cdw15 : 00000000
timeout_ms : 00000000
result : 00000000
err : 0
latency : 80 us
could not alloc 3977915852761149491 discovery log records
failed to get discovery log: Cannot allocate memory
nvme3: nqn.2014-08.org.nvmexpress.discovery disconnected
ahh maybe it's a buffer alignment problem:
addr : 7ffd78a76b90
That really looks like a buffer alignment issue. RDMA is working on pages (that's the 'dma' bit in RDMA ...), so I would expect it to require the buffer to be page aligned. Hmm.
Can you try again after updating nvme-cli?
I tried with the latest nvme-cli. Attached is a log with several attempts. Most of them fail with the protocol error. But I also still got the memory allocation error once. And once it also failed with a different error which I have never seen before.
The first identify command returns
result : 43ee43ee
This is a really funky value. The rest seems to be okay. I don't have a lot experience with RDMA. Are you sure the host and the target can 'talk' to each other correctly? Is there a way to check this?
Yes, as I wrote in the original post, I can mount one of the 3 disks via rdma, and that disk works normally, and I also get descent 1gb/s transfer rate with that disk.
For the other 2 disks, mounting them fails if I connect via rdma, but I can also read them correctly via dd.
I'm hitting a bug which looks very similar as hit. I get it when running a blktests test case with RDMA transport but not with the others transports. This seems to be a kernel thing. Maybe I can shed some light on it with the reproducer in my test bench.
Until now the funky result values have not correctly propagated. This is way I seem these now as well. Still searching were we do not set the result field correctly.
Are there any updates on this, can I do anything to help debugging?
I've got a kernel patch for rdma but haven't found time to post it yet. If you could test this, it would really help.
From 9f2480e883d12541289fe6187057991c42b4c589 Mon Sep 17 00:00:00 2001
From: Daniel Wagner <dwagner@suse.de>
Date: Fri, 12 Apr 2024 14:10:02 +0200
Subject: [PATCH] nvmet: always initialize cqe.result
The spec doesn't mandate that the first two double words (aka results)
for the command queue entry need to be set to 0 when they are not
used (not specified). Though, the target implemention returns for all
transports but RDMA always 0 when a command is not using it.
Let's make RDMA behave the same by always explicitly initializing it.
This also prevents leaking any stale memory data.
Signed-off-by: Daniel Wagner <dwagner@suse.de>
---
drivers/nvme/target/core.c | 1 +
drivers/nvme/target/fabrics-cmd-auth.c | 3 ---
drivers/nvme/target/fabrics-cmd.c | 6 ------
3 files changed, 1 insertion(+), 9 deletions(-)
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 20b138ed571f..3c43d9612947 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -946,6 +946,7 @@ bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq,
req->metadata_sg_cnt = 0;
req->transfer_len = 0;
req->metadata_len = 0;
+ req->cqe->result.u64 = 0;
req->cqe->status = 0;
req->cqe->sq_head = 0;
req->ns = NULL;
diff --git a/drivers/nvme/target/fabrics-cmd-auth.c b/drivers/nvme/target/fabrics-cmd-auth.c
index eb7785be0ca7..ee76491e8b12 100644
--- a/drivers/nvme/target/fabrics-cmd-auth.c
+++ b/drivers/nvme/target/fabrics-cmd-auth.c
@@ -332,7 +332,6 @@ void nvmet_execute_auth_send(struct nvmet_req *req)
pr_debug("%s: ctrl %d qid %d nvme status %x error loc %d\n",
__func__, ctrl->cntlid, req->sq->qid,
status, req->error_loc);
- req->cqe->result.u64 = 0;
if (req->sq->dhchap_step != NVME_AUTH_DHCHAP_MESSAGE_SUCCESS2 &&
req->sq->dhchap_step != NVME_AUTH_DHCHAP_MESSAGE_FAILURE2) {
unsigned long auth_expire_secs = ctrl->kato ? ctrl->kato : 120;
@@ -515,8 +514,6 @@ void nvmet_execute_auth_receive(struct nvmet_req *req)
status = nvmet_copy_to_sgl(req, 0, d, al);
kfree(d);
done:
- req->cqe->result.u64 = 0;
-
if (req->sq->dhchap_step == NVME_AUTH_DHCHAP_MESSAGE_SUCCESS2)
nvmet_auth_sq_free(req->sq);
else if (req->sq->dhchap_step == NVME_AUTH_DHCHAP_MESSAGE_FAILURE1) {
diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c
index b23f4cf840bd..f6714453b8bb 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -226,9 +226,6 @@ static void nvmet_execute_admin_connect(struct nvmet_req *req)
if (status)
goto out;
- /* zero out initial completion result, assign values as needed */
- req->cqe->result.u32 = 0;
-
if (c->recfmt != 0) {
pr_warn("invalid connect version (%d).\n",
le16_to_cpu(c->recfmt));
@@ -304,9 +301,6 @@ static void nvmet_execute_io_connect(struct nvmet_req *req)
if (status)
goto out;
- /* zero out initial completion result, assign values as needed */
- req->cqe->result.u32 = 0;
-
if (c->recfmt != 0) {
pr_warn("invalid connect version (%d).\n",
le16_to_cpu(c->recfmt));
--
2.44.0
I just tested with your patch (applied to 6.8 kernel / gentoo linux kernel 6.8.7), and used libvnme and nvme-cli from git with master of today. It still doesn't work and fails in the same way before. Here are the logs:
qon ~ # nvme discover -t rdma -a 192.168.110.9 -s 4421 -vvv
scan controller nvme0
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme0
scan controller nvme0 path nvme0c0n1
scan controller nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme1
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme1
scan controller nvme1 path nvme1c1n1
scan controller nvme2
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme2
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme2
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys2/nvme2
scan controller nvme2 path nvme2c2n1
scan subsystem nvme-subsys0
scan subsystem nvme-subsys0 namespace nvme0n1
scan subsystem nvme-subsys1
scan subsystem nvme-subsys1 namespace nvme1n1
scan subsystem nvme-subsys2
scan subsystem nvme-subsys2 namespace nvme2n1
Error opening /etc/nvme/config.json, No such file or directory
kernel supports: instance cntlid transport traddr trsvcid nqn queue_size nr_io_queues reconnect_delay ctrl_loss_tmo keep_alive_tmo hostnqn host_traddr host_iface hostid duplicate_connect disable_sqflow hdr_digest data_digest nr_write_queues nr_poll_queues tos fast_io_fail_tmo discovery
option "keyring" ignored
connect ctrl, 'nqn=nqn.2014-08.org.nvmexpress.discovery,transport=rdma,traddr=192.168.110.9,trsvcid=4421,hostnqn=nqn.2014-08.org.nvmexpress:uuid:990e5a47-9734-32f6-9ee6-3497f631db95,hostid=990e5a47-9734-32f6-9ee6-3497f631db95,ctrl_loss_tmo=600'
connect ctrl, response 'instance=3,cntlid=4'
nvme3: nqn.2014-08.org.nvmexpress.discovery connected
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys0/nvme3
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys1/nvme3
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys2/nvme3
lookup subsystem /sys/class/nvme-subsystem/nvme-subsys3/nvme3
opcode : 06
flags : 00
rsvd1 : 0000
nsid : 00000000
cdw2 : 00000000
cdw3 : 00000000
data_len : 00001000
metadata_len : 00000000
addr : 55d46e2aa000
metadata : 0
cdw10 : 00000001
cdw11 : 00000000
cdw12 : 00000000
cdw13 : 00000000
cdw14 : 00000000
cdw15 : 00000000
timeout_ms : 00000000
result : 00000000
err : 0
latency : 70 us
nvme3: get header (try 0/10)
opcode : 02
flags : 00
rsvd1 : 0000
nsid : 00000000
cdw2 : 00000000
cdw3 : 00000000
data_len : 00000014
metadata_len : 00000000
addr : 55d46e2aa000
metadata : 0
cdw10 : 00040070
cdw11 : 00000000
cdw12 : 00000000
cdw13 : 00000000
cdw14 : 00000000
cdw15 : 00000000
timeout_ms : 00000000
result : 00000000
err : 0
latency : 76 us
could not alloc memory for discovery log page
failed to get discovery log: Cannot allocate memory
nvme3: nqn.2014-08.org.nvmexpress.discovery disconnected
qon ~ # nvme connect -n nqn.2014-08.org.nvmexpress.discovery -t rdma -a 192.168.110.9 -s 4421
connecting to device: nvme3
qon ~ # nvme get-log /dev/nvme3 -i 0x70 -l 8192
Device:nvme3 log-id:112 namespace-id:0xffffffff
0 1 2 3 4 5 6 7 8 9 a b c d e f
0000: 09 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00 "................"
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
00f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0150: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
01f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0240: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0250: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0260: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0270: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0290: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
02f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0310: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0320: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0330: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0340: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0350: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0360: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0370: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0390: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
03f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0400: 01 01 03 04 02 00 ff ff 20 00 00 00 00 00 00 00 "................"
0410: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0420: 34 34 32 31 00 00 00 00 00 00 00 00 00 00 00 00 "4421............"
0430: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0440: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0450: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0460: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0470: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0490: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
04a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
04b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
04c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
04d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
04e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
04f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0500: 6e 71 6e 2e 32 30 31 34 2d 30 38 2e 6f 72 67 2e "nqn.2014-08.org."
0510: 6e 76 6d 65 78 70 72 65 73 73 2e 64 69 73 63 6f "nvmexpress.disco"
0520: 76 65 72 79 00 00 00 00 00 00 00 00 00 00 00 00 "very............"
0530: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0540: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0550: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0560: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0570: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0590: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
05a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
05b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
05c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
05d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
05e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
05f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0600: 31 39 32 2e 31 36 38 2e 31 31 30 2e 39 00 00 00 "192.168.110.9..."
0610: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0620: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0630: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0640: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0650: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0660: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0670: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0690: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
06a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
06b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
06c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
06d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
06e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
06f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0700: 01 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0710: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0720: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0730: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0740: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0750: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0760: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0770: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0790: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
07a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
07b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
07c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
07d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
07e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
07f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0800: 01 01 02 04 02 00 ff ff 20 00 00 00 00 00 00 00 "................"
0810: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0820: 34 34 32 31 00 00 00 00 00 00 00 00 00 00 00 00 "4421............"
0830: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0840: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0850: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0860: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0870: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0890: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
08a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
08b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
08c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
08d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
08e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
08f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0900: 71 6f 6e 31 00 00 00 00 00 00 00 00 00 00 00 00 "qon1............"
0910: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0920: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0930: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0940: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0950: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0960: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0970: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0990: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
09a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
09b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
09c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
09d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
09e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
09f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0a00: 31 39 32 2e 31 36 38 2e 31 31 30 2e 39 00 00 00 "192.168.110.9..."
0a10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0a20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0a30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0a40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0a50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0a60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0a70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0a90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0aa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0ab0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0ac0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0ad0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0ae0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0af0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0b00: 01 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0b10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0b20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0b30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0b40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0b50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0b60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0b70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0b90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0ba0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0bb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0bc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0bd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0be0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0bf0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0c00: 01 01 02 04 02 00 ff ff 20 00 00 00 00 00 00 00 "................"
0c10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0c20: 34 34 32 31 00 00 00 00 00 00 00 00 00 00 00 00 "4421............"
0c30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0c40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0c50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0c60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0c70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0c90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0ca0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0cb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0cc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0cd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0ce0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0cf0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0d00: 71 6f 6e 32 00 00 00 00 00 00 00 00 00 00 00 00 "qon2............"
0d10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0d20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0d30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0d40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0d50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0d60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0d70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0d80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0d90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0da0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0db0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0dc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0dd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0de0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0df0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0e00: 31 39 32 2e 31 36 38 2e 31 31 30 2e 39 00 00 00 "192.168.110.9..."
0e10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0e20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0e30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0e40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0e50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0e60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0e70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0e80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0e90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0ea0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0eb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0ec0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0ed0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0ee0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0ef0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0f00: 01 01 01 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0f10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0f20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0f30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0f40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0f50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0f60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0f70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0f90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0fa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
0ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1000: 01 01 02 04 02 00 ff ff 20 00 00 00 00 00 00 00 "................"
1010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1020: 34 34 32 31 00 00 00 00 00 00 00 00 00 00 00 00 "4421............"
1030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
10a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
10b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
10c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
10d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
10e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
10f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1100: 71 6f 6e 33 00 00 00 00 00 00 00 00 00 00 00 00 "qon3............"
1110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1150: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1190: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
11a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
11b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
11c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
11d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
11e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
11f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1200: 31 39 32 2e 31 36 38 2e 31 31 30 2e 39 00 00 00 "192.168.110.9..."
1210: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1220: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1230: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
1240: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................"
Alright, so the results
garbage is not really making the host confused. Slowly running out of idea. The only difference between what libnvme is doing and the shell command is the transfer length. Could you limit it to 20 bytes and see if this changes anything?
# nvme get-log /dev/nvme3 -i 0x70 -l 20
I executed the command a couple of times. Most times the result seems correct, but once apparently I got garbage:
qon ~ # nvme get-log /dev/nvme3 -i 0x70 -l 20
Device:nvme3 log-id:112 namespace-id:0xffffffff
0 1 2 3 4 5 6 7 8 9 a b c d e f
0000: 09 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00 "................"
0010: 00 00 00 00 "...."
qon ~ # nvme get-log /dev/nvme3 -i 0x70 -l 20
Device:nvme3 log-id:112 namespace-id:0xffffffff
0 1 2 3 4 5 6 7 8 9 a b c d e f
0000: 00 00 00 00 35 66 36 65 38 30 33 62 33 39 32 64 "....5f6e803b392d"
0010: 39 37 36 37 "9767"
qon ~ # nvme get-log /dev/nvme3 -i 0x70 -l 20
Device:nvme3 log-id:112 namespace-id:0xffffffff
0 1 2 3 4 5 6 7 8 9 a b c d e f
0000: 09 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00 "................"
0010: 00 00 00 00 "...."
qon ~ # nvme get-log /dev/nvme3 -i 0x70 -l 20
Device:nvme3 log-id:112 namespace-id:0xffffffff
0 1 2 3 4 5 6 7 8 9 a b c d e f
0000: 09 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00 "................"
0010: 00 00 00 00 "...."
qon ~ # nvme get-log /dev/nvme3 -i 0x70 -l 20
Device:nvme3 log-id:112 namespace-id:0xffffffff
0 1 2 3 4 5 6 7 8 9 a b c d e f
0000: 09 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00 "................"
0010: 00 00 00 00 "...."
Either the transport is having troubles or the target implementation is at fault. I think it's more likely that the target is having some sort of race and libnvme is able to hit it very reliable.
libnvme is sending two commands back to back and the second gets corrupted data. I suspect from your last test if we would wait a bit between the commands the second command could succeed.
Maybe we see something on the ftrace (on the target):
# cd /sys/kernel/debug/tracing/
# echo 1 > events/nvmet/enable
# echo 1 > tracing_on
[wait for the failure]
# echo 0 > tracing_on
# cat trace > ~/nvmet-trace.txt
and also could you enable the debug logs on the target side?
# echo 8 > /proc/sys/kernel/printk
# echo 'file drivers/nvme/target/*.c +p' > /sys/kernel/debug/dynamic_debug/control
Sure, will do so. But I'll be traveling the next days, so only have access to the system in ~1 week.
Still interested to get this working? In the meantime the kernel patch has been accepted. So the random results
values should be gone. But unfortunately, this was the cause of the problem. Still need the logs from above.
Sure, I'd like to have this fixed. I tried the trace command, but didn't work. I assume since something was not enabled in the kernel. Need to check. But then I was again traveling the last 4 weeks and couldn't work on this. Sorry for the slow process. I'll report back when I have something, and will try with latest kernel / sw when I am back.
Okay, no worries, I am busy too.
I've double checked the commands and on my system they work. So it really sounds like your missing the kernel config options. Are you running a distro kernel? In any case the CONFIG_FTRACE and CONFIG_DYNAMIC_DEBUG options would be necessary.
I finally found some time to get the traces with the CONFIG_DYNAMIC_DEBUG
trace file from the target is attached, taken during the nvme discover ...
operations, which fails.
For reference, here is the dmesg during this time:
[Mon Aug 12 20:55:16 2024] nvmet_rdma: connect request (4): status 0 id 00000000ff1fce59
[Mon Aug 12 20:55:16 2024] nvmet_rdma: added mlx4_0.
[Mon Aug 12 20:55:16 2024] nvmet_rdma: nvmet_rdma_create_queue_ib: max_cqe= 8191 max_sge= 32 sq_size = 509 cm_id= 00000000ff1fce59
[Mon Aug 12 20:55:16 2024] nvmet_rdma: established (9): status 0 id 00000000ff1fce59
[Mon Aug 12 20:55:16 2024] nvmet: ctrl 1 start keep-alive timer for 120 secs
[Mon Aug 12 20:55:16 2024] nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:990e5a47-9734-32f6-9ee6-3497f631db95.
[Mon Aug 12 20:55:17 2024] nvmet_rdma: disconnected (10): status 0 id 00000000ff1fce59
[Mon Aug 12 20:55:17 2024] nvmet_rdma: cm_id= 00000000ff1fce59 queue->state= 1
[Mon Aug 12 20:55:17 2024] nvmet_rdma: freeing queue 0
[Mon Aug 12 20:55:17 2024] nvmet: ctrl 1 stop keep-alive
So the disconnected (10): status 0 id 00000000ff1fce59
is from nvmet_rdma_cm_handler
function. This really looks like that the RDMA driver is unhappy with something. I can't really tell, I don't have access to the spec. I suggest you bring this up on the nvme/rdma mailing list. Sorry about that.
Just for reference, I had the chance to test with ConnectX6 NICs instead of ConnectX3, and I got the very same problem. Otherwise, didn't have time to investigate further / report.
Some more info for reference: As I said, I can do nvme connect
despite the nvme discover
failure, and then I see the disk as block device. I tried reading from the block device with dd
with various block sizes, and every ready returns the correct result (I was thinking perhaps it depends on the read size, and small reads as for the discover might fail).
I have also ran ib_write_bw
, ib_read_bw
and ib_send_bw
benchmarks with all transfer sizes with both systems both as client and server, all worked nicely.
What I am wondering as a side note: Didn't you say the error propagation should be fixed now?
Since for me, when it fails the nvme discover
still reports
could not alloc memory for discovery log page
failed to get discovery log: Cannot allocate memory
Shouldn't it produce a proper error message meanwhile? Just tried with nvme-cli and libnvme master of today.
I had to revert those error propagation changes. I understood it wrongly how the error path works. Those changes broke a bunch of stuff.
I have to look into this once again, maybe this time I see something. There must be something be very weird about the get log page command if connect just works.
Maybe some other admin commands would fail too, e.g. nvme id-ctrl
or nvme id-ns
.
One thing which could be wrong with the get-log
command is that it doesn't set the namespace id to 0. To really match
with what the discover code in libnvme does, would be
nvme get-log /dev/nvme3 -i 0x70 -l 20 -n 0 -vv
But again, we know that the discover command setup by nvme-cli also fails and sees garbage in the return buffer.
I wonder if any of the other admin commands also fail? This could point to the admin handling code in nvmet. Could you try to trigger an error with nvme id-ctrl /dev/nvme3
, e.g. while nvme id-ctrl /dev/nvme3 ; do true ; done
Also can you do a test if nvme get-log /dev/nvme3 -i 0x70 -l 4096 -n 0 -vv
ever returns garbage? Trying to figure out if the SGL inlining doesn't always work.
I am posting here as libnvme issue, but I am not really sure if the problem is in libnvme, nvme-cli, the kernel or somewhere else.
I have 2 computers, pc1 and pc2, connected via ConnectX3 Infiniband adapters, with pc1 mounting 3 remote nvme SSDs from pc2: a samsung 990 Pro 2TB and 2 Corsair MP600 Pro 8TB.
I am having 2 problems:
nvme discover -t rdma -a [IP] -s 4420
, I get the error:nvme connect -n [SUBSYS] -t rdma -a [IP] -s 4420, and then the Samsung 2TB SDD seem to work normally, but I cannot mount a file system from the Corsair SSDs (even though I see them normally in
/proc/partitions`).I tried the following things already:
tcp
type instead ofrdma
in all nvme commands, everything works.nvme discover
.nvme discovery
error: I debugged what happens in libnvme when I get thecould not alloc memory for discovery log page
, and it seems thenumrec
value here https://github.com/linux-nvme/libnvme/blob/691f809a4d6864e56c92326ce7d047f3f479aca3/src/nvme/fabrics.c#L1089 is bogus, thus libnvme tries to allocate a very large amount of memory and fails.FAT-fs (nvme1n1p1): bogus number of reserved sectors
and from the mount command:wrong fs type, bad option, bad superblock on ..., missing codepage or helper program, or other error
. Mounting the very same file system when connecting viatcp
instead ofrdma
works correctly. I wanted to check if I get corrupted data from this disk viardma
, so I dumped the whole filesystem to an external disk withdd
, and compared what I read when connecting withrdma
and withtcp
, and it is fully identical. So the data I am getting from the disk is correct, mounting fails for a different reason. Sector size and disk size I get when opening the disk withgdisk
are also identical.So in summary, everything works when connecting via
tcp
, but when connecting viardma
thenvme discovery
fails since thenumrec
entry is bogus, and withrdma
I can mount filesystems from the 2TB samsung SSD correctly, but filesystems from the 8TB corsair SSDs fail to mount, despite the data I am reading from the disk withdd
is fully correct.I would be thankful for any advise, or recommendation where to ask. I you'd need me to conduct any tests, please let me know.