osandov / blktests

Linux kernel block layer testing framework
117 stars 75 forks source link

block/009 failed #17

Closed gurugio closed 6 years ago

gurugio commented 6 years ago

Hi,

The block/009 test case failed as below. I tested it on two machines and both resulted the same failure.

gohkim@ws00837:~/work/tools/blktests$ sudo ./check block/009
block/009 (check page-cache coherency after BLKDISCARD)      [failed]
    runtime  ...  0,977s
    --- tests/block/009.out 2018-04-23 17:52:00.222822301 +0200
    +++ results/nodev/block/009.out.bad 2018-04-23 17:54:34.571254091 +0200
    @@ -1,8 +1,17 @@
     Running block/009
     0000000 0000 0000 0000 0000 0000 0000 0000 0000
     *
    +0200000 bbbb bbbb bbbb bbbb bbbb bbbb bbbb bbbb
    +*
    +0c00000 0000 0000 0000 0000 0000 0000 0000 0000
    +*
    ...
    (Run 'diff -u tests/block/009.out results/nodev/block/009.out.bad' to see the entire diff)
gohkim@ws00837:~/work/tools/blktests$ cat results/nodev/block/009.out.bad 
Running block/009
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0200000 bbbb bbbb bbbb bbbb bbbb bbbb bbbb bbbb
*
0c00000 0000 0000 0000 0000 0000 0000 0000 0000
*
2000000
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0200000 bbbb bbbb bbbb bbbb bbbb bbbb bbbb bbbb
*
0c00000 0000 0000 0000 0000 0000 0000 0000 0000
*
2000000
modprobe: FATAL: Module scsi_debug is in use.
Test complete

If I ran blkdiscard after the test, the sdg device was cleared.

root@ib1:/home/gkim/pb-ltp/blktests# ./check block/009
block/009 (check page-cache coherency after BLKDISCARD)      [failed]
    runtime  2.334s  ...  2.380s
    --- tests/block/009.out 2018-04-23 14:21:33.011890789 +0200
    +++ results/nodev/block/009.out.bad 2018-04-23 18:07:12.451914680 +0200
    @@ -1,8 +1,17 @@
     Running block/009
     0000000 0000 0000 0000 0000 0000 0000 0000 0000
     *
    +0200000 bbbb bbbb bbbb bbbb bbbb bbbb bbbb bbbb
    +*
    +0c00000 0000 0000 0000 0000 0000 0000 0000 0000
    +*
    ...
    (Run 'diff -u tests/block/009.out results/nodev/block/009.out.bad' to see the entire diff)
root@ib1:/home/gkim/pb-ltp/blktests# blkdiscard /dev/sdg
root@ib1:/home/gkim/pb-ltp/blktests# dd if=/dev/sdg bs=4M iflag=direct | hexdump
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
8+0 records in
8+0 records out
33554432 bytes (34 MB) copied, 0.122978 s, 273 MB/s
2000000

I hope you could guess the reason. Please inform me if you need more information.

sitsofe commented 6 years ago

@gurugio Might be linked to your message:

modprobe: FATAL: Module scsi_debug is in use.

So maybe you made stacked device or a filesystem atop the device that scsi_debug made thus locking it open...

gurugio commented 6 years ago

I checked a reference counter of the scsi_debug module. The reference counter was 0 just after it was loaded. So I think there is nothing else to use the scsi_debug module. The reference counter was 1 when modprobe tried to unload the scsi_debug module. And it became 0 at the end of the test.

After two xfs_io executed, the reference counter became 2. Is there a possibility that xfs_io was not finished?

gohkim@ws00837:~/work/tools/blktests$ sudo modprobe -r scsi_debug
gohkim@ws00837:~/work/tools/blktests$ git diff tests/block/009
diff --git a/tests/block/009 b/tests/block/009
index 69d0be7..9b64500 100755
--- a/tests/block/009
+++ b/tests/block/009
@@ -38,19 +38,27 @@ test() {
                return 1
        fi

+       lsmod | grep scsi
+       
        local dev="/dev/${SCSI_DEBUG_DEVICES[0]}"
        # Initialize data with known pattern
        xfs_io -c "pwrite -S 0xaa -b 2M 0 32M -w" -d "$dev" >>"$FULL"

+       lsmod | grep scsi
        # Fill page cache with dirty data
        xfs_io -c "pwrite -S 0xbb -b 2M 2M 10M" -f "$dev" >>"$FULL"
+       lsmod | grep scsi

        blkdiscard "$dev"
+
        # After device was fully discarded, read sould return all zeroes
        dd if="$dev" bs=4k  2>>"$FULL" | hexdump
        dd if="$dev" bs=4M iflag=direct 2>>"$FULL" | hexdump
+       lsmod | grep scsi

        _exit_scsi_debug
+       lsmod | grep scsi

        echo "Test complete"
+       lsmod | grep scsi
 }
gohkim@ws00837:~/work/tools/blktests$ cat results/nodev/block/009.out.bad 
Running block/009
scsi_debug             77824  0
scsi_debug             77824  1
scsi_debug             77824  2
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
2000000
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
2000000
scsi_debug             77824  1
modprobe: FATAL: Module scsi_debug is in use.
scsi_debug             77824  1
Test complete
scsi_debug             77824  0

I added a "-d" option to pwrite for direct IO and had no data conflict.

gohkim@ws00837:~/work/tools/blktests$ git diff
diff --git a/tests/block/009 b/tests/block/009
index 69d0be7..c72627f 100755
--- a/tests/block/009
+++ b/tests/block/009
@@ -40,12 +40,13 @@ test() {

        local dev="/dev/${SCSI_DEBUG_DEVICES[0]}"
        # Initialize data with known pattern
-       xfs_io -c "pwrite -S 0xaa -b 2M 0 32M -w" -d "$dev" >>"$FULL"
+       xfs_io -c "pwrite -S 0xaa -b 2M 0 32M -d -w" -d "$dev" >>"$FULL"

        # Fill page cache with dirty data
-       xfs_io -c "pwrite -S 0xbb -b 2M 2M 10M" -f "$dev" >>"$FULL"
+       xfs_io -c "pwrite -S 0xbb -b 2M 2M 10M -d" -f "$dev" >>"$FULL"

        blkdiscard "$dev"
+
        # After device was fully discarded, read sould return all zeroes
        dd if="$dev" bs=4k  2>>"$FULL" | hexdump
        dd if="$dev" bs=4M iflag=direct 2>>"$FULL" | hexdump
gohkim@ws00837:~/work/tools/blktests$ sudo ./check block/009
block/009 (check page-cache coherency after BLKDISCARD)      [failed]
    runtime  1,080s  ...  1,074s
    --- tests/block/009.out 2018-04-23 17:52:00.222822301 +0200
    +++ results/nodev/block/009.out.bad 2018-04-24 10:56:02.220782318 +0200
    @@ -5,4 +5,5 @@
     0000000 0000 0000 0000 0000 0000 0000 0000 0000
     *
     2000000
    +modprobe: FATAL: Module scsi_debug is in use.
     Test complete

I checked /sys/block/sdc/stat and found that data was written to the disk after blkdiscard.

gohkim@ws00837:~/work/tools/blktests$ git diff
diff --git a/tests/block/009 b/tests/block/009
index 69d0be7..fbf91f4 100755
--- a/tests/block/009
+++ b/tests/block/009
@@ -44,13 +44,17 @@ test() {

        # Fill page cache with dirty data
        xfs_io -c "pwrite -S 0xbb -b 2M 2M 10M" -f "$dev" >>"$FULL"
-
+       cat /sys/block/sdc/stat
        blkdiscard "$dev"
+       cat /sys/block/sdc/stat
        # After device was fully discarded, read sould return all zeroes
        dd if="$dev" bs=4k  2>>"$FULL" | hexdump
        dd if="$dev" bs=4M iflag=direct 2>>"$FULL" | hexdump
+       cat /sys/block/sdc/stat

        _exit_scsi_debug

        echo "Test complete"
+       cat /sys/block/sdc/stat
+
 }
gohkim@ws00837:~/work/tools/blktests$ cat results/nodev/block/009.out.bad 
Running block/009
      33        0      704      132       65        0    65536      264        0      188      392
      34        0      712      136       67        0   131072      280        1      196      408  ---> after blkdiscard
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0200000 bbbb bbbb bbbb bbbb bbbb bbbb bbbb bbbb
*
0c00000 0000 0000 0000 0000 0000 0000 0000 0000
*
2000000
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0200000 bbbb bbbb bbbb bbbb bbbb bbbb bbbb bbbb
*
0c00000 0000 0000 0000 0000 0000 0000 0000 0000
*
2000000
     308        0   111296     1244       87     2540   151552      360        0      640     1596 ---> after dd
modprobe: FATAL: Module scsi_debug is in use.
Test complete
     308        0   111296     1244       87     2540   151552      360        0      640     1596
osandov commented 6 years ago

@gurugio what kernel version is this on? It looks like the fix that this test is testing went in for 4.15.

I've seen the modprobe issue but haven't figured out who is still holding the device open. But let's focus on the test failure first.

gurugio commented 6 years ago

I am using Ubuntu and the standard kernel as following.

gohkim@ws00837:~/work$ uname -a
Linux ws00837 4.13.0-16-generic #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
gohkim@ws00837:~/work$ cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=17.10
DISTRIB_CODENAME=artful
DISTRIB_DESCRIPTION="Ubuntu 17.10"
gurugio commented 6 years ago

My colleague found a patch related to discard block device in kernel log. Should I apply that patch?

351499a1 block: Invalidate cache on discard v2
osandov commented 6 years ago

Yes, that is the patch I'm talking about. All of the regression tests in blktests mention what patch they depend on, see https://github.com/osandov/blktests/blob/master/tests/block/009.

osandov commented 6 years ago

You will also need 0bd1ed4860d0 ("block: pass inclusive 'lend' parameter to truncate_inode_pages_range").

gurugio commented 6 years ago

OK, thank you. Then could you inform me what to do for "modprobe: FATAL: Module scsi_debug is in use." error?

osandov commented 6 years ago

@gurugio, I just pushed a commit which should fix this.

gurugio commented 6 years ago

Thanks!!