osandov / blktests

Linux kernel block layer testing framework
117 stars 75 forks source link

random nbd/002 failure #142

Closed yizhanglinux closed 2 months ago

yizhanglinux commented 3 months ago

CKI reported nbd/002 failure recently on 6.10.0-rc4, it also can be reproduced on 6.9.

=================================15 nbd/002 (tests on partition handling for an nbd device) [failed] runtime 1.507s ... 4.495s --- tests/nbd/002.out 2024-06-18 21:37:53.351495157 -0400 +++ /root/blktests/results/nodev/nbd/002.out.bad 2024-06-18 22:08:20.656824752 -0400 @@ -1,4 +1,3 @@ Running nbd/002 Testing IOCTL path -Testing the netlink path -Test complete +Didn't have partition on ioctl path

[root@dell-r640-053 blktests]# cat results/nodev/nbd/002.full Error: Socket failed: Connection refused stat: cannot statx '/dev/nbd0p1': No such file or directory Negotiation: ..size = 10240MB bs=512, sz=10737418240 bytes stat: cannot statx '/dev/nbd0p1': No such file or directory stat: cannot statx '/dev/nbd0p1': No such file or directory stat: cannot statx '/dev/nbd0p1': No such file or directory disconnect, sock, done [root@dell-r640-053 blktests]# cat results/nodev/nbd/002.out.bad Running nbd/002 Testing IOCTL path Didn't have partition on ioctl path

dmesg: [ 741.052034] run blktests nbd/002 at 2024-06-18 22:08:16 [ 741.585612] nbd0: detected capacity change from 0 to 20971520 [ 745.603328] block nbd0: NBD_DISCONNECT [ 745.603363] block nbd0: Disconnected due to user request. [ 745.603366] block nbd0: shutting down sockets

yizhanglinux commented 3 months ago

BTW, I tried on another server and it cannot be reproduced, but can be reproduced on the reproduced server within 100 times.

kawasaki commented 3 months ago

@yizhanglinux Thanks for the reports. I will take a look in the nbd/001 and nbd/002 failures later. Now I'm fed up with other failures and reviews...

kawasaki commented 3 months ago

I took a look in this issue and #141. These two failures both have the error message "Socket failed: Connection refused". I think this means ECONNREFUSED. I grepped this in the kernel code, but it was not found in the nbd driver. They are found in the network sub-systems. So I guess the error means the nbd-server socket is not yet ready when nbd-client connects by some reason.

I guess the fix should be in the test case, and it is to wait the nbd-server socket gets ready before nbd-client connects. I have created a fix trial patch at my nbd branch in my blktests repo.

@yizhanglinux Are the failures recreated on your test machines? If so, could you apply my fix trial patch and see if it avoids the failures?

yizhanglinux commented 2 months ago

Yes, I tried the patch on the reproduced server, and couldn't reproduce the nbd/001 nbd/002 failure now with more than 2000 times.

======================================2322
nbd/001 (resize a connected nbd device)                      [passed]
    runtime  1.300s  ...  1.330s
nbd/002 (tests on partition handling for an nbd device)      [passed]
    runtime  1.666s  ...  1.967s
kawasaki commented 2 months ago

@yizhanglinux Thank you for the confirmation! I posted the fix to relevant lists for reviews.

kawasaki commented 2 months ago

The modified fix 0c3dfdd is now in the master branch. Let me close this case.