Linux 5.3.9 crash with ocfs2

vvidic commented 4 years ago

While testing ocfs2-tools I noticed that kernel 5.3.9 enters a crash loop after some ocfs2 errors so I wonder if this is a know bug somewhere?

[  101.538203] ocfs2: Mounting device (7,0) on (node 0, slot 0) with ordered data mode.
[  101.616011] blk_update_request: I/O error, dev fd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  101.619168] floppy: error 10 while reading block 0
[  101.720020] blk_update_request: I/O error, dev fd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  101.724018] floppy: error 10 while reading block 0
[  101.847482] (cp,1633,0):ocfs2_xa_set:2242 ERROR: status = -22
[  101.848302] (cp,1633,0):ocfs2_xattr_ibody_set:2786 ERROR: status = -22
[  105.999960] o2dlm: Leaving domain F2DFFDA43E844B1CB2F230D9E08CF18E
[  106.051772] blk_update_request: I/O error, dev fd0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  106.055753] floppy: error 10 while reading block 0
[  107.536390] ocfs2: Unmounting device (7,0) on (node 0)
[  108.116585] ocfs2: Unregistered cluster interface o2cb
[  121.227597] general protection fault: 0000 [#1] SMP PTI
[  121.228131] CPU: 0 PID: 2172 Comm: invoke-rc.d Not tainted 5.3.0-2-amd64 #1 Debian 5.3.9-2
[  121.228906] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[  121.229692] RIP: 0010:__kmalloc+0xa1/0x260
[  121.230083] Code: 65 49 8b 50 08 65 4c 03 05 44 a4 bb 6d 4d 8b 30 4d 85 f6 0f 84 90 01 00 00 41 8b 5f 20 49 8b 3f 48 8d 4a 01 4c 89 f0 4c 01 f3 <48> 33 1b 49 33 9f 70 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 74 bd
[  121.232383] RSP: 0018:ffffa75940343c08 EFLAGS: 00010286
[  121.233441] RAX: 817591233dcf4f99 RBX: 817591233dcf4f99 RCX: 00000000000570f0
[  121.234679] RDX: 00000000000570ef RSI: 0000000000000dc0 RDI: 000000000002d8c0
[  121.235929] RBP: 0000000000000dc0 R08: ffff96117ea2d8c0 R09: 0000000001010101
[  121.237165] R10: 000000000101012e R11: 00000000d118c126 R12: 0000000000000032
[  121.238398] R13: ffff96117e401700 R14: 817591233dcf4f99 R15: ffff96117e401700
[  121.239624] FS:  00007f86494f1740(0000) GS:ffff96117ea00000(0000) knlGS:0000000000000000
[  121.240934] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  121.242042] CR2: 000055ca6dba9a98 CR3: 0000000039578000 CR4: 00000000000006f0
[  121.243272] Call Trace:
[  121.244122]  ? ext4_htree_store_dirent+0x36/0x100 [ext4]
[  121.245204]  ext4_htree_store_dirent+0x36/0x100 [ext4]
[  121.246289]  htree_dirblock_to_tree+0xf5/0x290 [ext4]
[  121.247363]  ? terminate_walk+0x8c/0x100
[  121.248317]  ext4_htree_fill_tree+0xa2/0x2e0 [ext4]
[  121.249371]  ? mem_cgroup_charge_statistics+0x4c/0xd0
[  121.250440]  ? _cond_resched+0x15/0x30
[  121.251400]  ? kmem_cache_alloc_trace+0x158/0x210
[  121.252449]  ext4_readdir+0x6fc/0xa90 [ext4]
[  121.253439]  iterate_dir+0x143/0x1a0
[  121.254373]  ksys_getdents64+0x9c/0x130
[  121.255340]  ? filldir+0x170/0x170
[  121.256268]  __x64_sys_getdents64+0x16/0x20
[  121.257284]  do_syscall_64+0x53/0x140
[  121.258227]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

josephhz commented 4 years ago

Hi, could you please try the following patch: https://oss.oracle.com/pipermail/ocfs2-devel/2019-November/014373.html

rafaeldtinoco commented 4 years ago

Alright, so I have just tested with:

commit 5d70d1a82dcb (HEAD -> master, tag: Ubuntu-5.4.0-9.12, origin/master, origin/HEAD) Author: Seth Forshee seth.forshee@canonical.com Date: Mon Dec 16 20:54:20 2019

UBUNTU: Ubuntu-5.4.0-9.12

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>

Which contains the commits you mentioned:

commit 94b07b6f9e2e Author: Joseph Qi joseph.qi@linux.alibaba.com Date: Fri Nov 22 01:53:52 2019

Revert "fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()"

commit 56e94ea132bb Author: Jia-Ju Bai baijiaju1990@gmail.com Date: Mon Oct 7 00:57:50 2019

fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()

After executing the reproducer... you just need to wait and kernel bug will happen:

[    7.320804] ocfs2: Registered cluster interface o2cb
[    7.345918] OCFS2 User DLM kernel interface loaded
[    7.363120] o2hb: Heartbeat mode set to local
[   91.876740] ocfs2: Unregistered cluster interface o2cb
[   92.040182] ocfs2: Registered cluster interface o2cb
[   92.050382] OCFS2 User DLM kernel interface loaded
[   92.057580] o2hb: Heartbeat mode set to local
[  100.101844] random: crng init done
[  100.101849] random: 7 urandom warning(s) missed due to ratelimiting
[  100.929650] o2dlm: Joining domain D99E3898F9364C40803F59DCC47A76B8
[  100.929651] (
[  100.929653] 0
[  100.929654] ) 1 nodes
[  100.930279] JBD2: Ignoring recovery information on journal
[  100.932340] ocfs2: Mounting device (7,0) on (node 0, slot 0) with ordered data mode.
[  105.036050] o2dlm: Leaving domain D99E3898F9364C40803F59DCC47A76B8
[  106.956472] ocfs2: Unmounting device (7,0) on (node 0)
[  126.664934] BUG: unable to handle page fault for address: ffff9be08882fcf6
[  126.666177] #PF: supervisor read access in kernel mode
[  126.667042] #PF: error_code(0x0000) - not-present page
[  126.667920] PGD 9e201067 P4D 9e201067 PUD 0
[  126.668655] Oops: 0000 [#1] SMP NOPTI
[  126.669308] CPU: 3 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.4.3 #1
[  126.670431] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[  126.671793] RIP: 0010:__kmalloc_track_caller+0xa5/0x270
[  126.672672] Code: 65 49 8b 50 08 65 4c 03 05 d0 b2 97 6a 4d 8b 38 4d 85 ff 0f 84 8b 01 00 00 41 8b 59 20 49 8b 39 48 8d 4a 01 4c 89 f8>
[  126.675555] RSP: 0018:ffffbc84c0013b90 EFLAGS: 00010286
[  126.676453] RAX: ffff9be08882fcf6 RBX: ffff9be08882fcf6 RCX: 0000000000000d25
[  126.677627] RDX: 0000000000000d24 RSI: 0000000000000cc0 RDI: 000000000002f080
[  126.678787] RBP: ffffbc84c0013bc8 R08: ffff9be0fbbaf080 R09: ffff9be0fb003880
[  126.679962] R10: ffffbc84c0013d56 R11: 0000000000000000 R12: 0000000000000cc0
[  126.681206] R13: 0000000000000016 R14: ffff9be0fb003880 R15: ffff9be08882fcf6
[  126.682498] FS:  00007f00a3535980(0000) GS:ffff9be0fbb80000(0000) knlGS:0000000000000000
[  126.683951] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  126.685041] CR2: ffff9be08882fcf6 CR3: 0000000139ebc000 CR4: 00000000000406e0
[  126.686340] Call Trace:
[  126.686849]  ? kstrdup_const+0x28/0x30
[  126.687576]  kstrdup+0x36/0x70
[  126.688153]  kstrdup_const+0x28/0x30
[  126.688790]  __kernfs_new_node+0x4c/0x1b0
[  126.689492]  ? mutex_lock+0x17/0x40
[  126.690119]  ? __kernfs_iattrs.isra.0+0x39/0xc0
[  126.690896]  ? kernfs_xattr_get+0x27/0x50
[  126.691595]  kernfs_new_node+0x3b/0x60
[  126.692280]  __kernfs_create_file+0x2d/0xc0
[  126.693019]  cgroup_addrm_files+0x13e/0x330
[  126.693754]  ? __kernfs_new_node+0x159/0x1b0
[  126.694494]  ? __radix_tree_replace+0x71/0x120
[  126.695264]  ? kernfs_link_sibling+0x9e/0xf0
[  126.696045]  css_populate_dir+0x11a/0x140
[  126.696750]  cgroup_mkdir+0x35a/0x520
[  126.697409]  kernfs_iop_mkdir+0x5f/0x90
[  126.698086]  vfs_mkdir+0x10d/0x1c0
[  126.698699]  do_mkdirat+0xed/0x120
[  126.699310]  __x64_sys_mkdir+0x1f/0x30
[  126.699991]  do_syscall_64+0x55/0x180
[  126.700639]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  126.701500] RIP: 0033:0x7f00a3880f0b
[  126.702296] Code: ff ff c3 0f 1f 40 00 48 8b 05 81 8f 0d 00 64 c7 00 5f 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 f3 0f 1e fa b8 53 00 00>
[  126.705950] RSP: 002b:00007ffe29861958 EFLAGS: 00000206 ORIG_RAX: 0000000000000053
[  126.707490] RAX: ffffffffffffffda RBX: 000055d4d4ac6a10 RCX: 00007f00a3880f0b
[  126.708810] RDX: 00007ffe29861820 RSI: 00000000000001ed RDI: 000055d4d4b1d0c0
[  126.710131] RBP: 00007f00a3704199 R08: 0000000000000001 R09: 2f73662f7379732f
[  126.711429] R10: 702f70756f726763 R11: 0000000000000206 R12: 0000000000000000
[  126.712758] R13: 000055d4d4ac6a10 R14: 0000000000000073 R15: 0000000000000007
[  126.714106] Modules linked in: ocfs2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue nls_iso8859_1>

Note: I have commented lots of things from the original test, its a really basic test with mkfs and mount only.

rafaeldtinoco commented 4 years ago

$ cat ocfs2_reproducer.sh
#!/bin/sh

set -e

DISK=/disk

cleanup () {
  umount /mnt || true
  [ "$LOOP" ] && losetup -d $LOOP
  rm -f $DISK $DISK.image || true
  # service o2cb stop
  sed -i -e 's/^O2CB_ENABLED=.*/O2CB_ENABLED=false/' /etc/default/o2cb
  echo cleanup
}

trap "cleanup" 0 2 3 15

# configure cluster
HOSTNAME=$(hostname)

cat >/etc/ocfs2/cluster.conf <<EOF
cluster:
        node_count = 1
        name = ocfs2

node:
        ip_port = 7777
        ip_address = 127.0.0.1
        number = 0
        name = $HOSTNAME
        cluster = ocfs2
EOF

# start cluster
sed -i -e 's/^O2CB_ENABLED=.*/O2CB_ENABLED=true/' /etc/default/o2cb
service o2cb restart

# check cluster
echo "=== dlmfs ==="
grep '^ocfs2_dlmfs /dlm' /proc/mounts

echo "=== lsmod ==="
lsmod | grep ocfs2_stack_o2cb

echo "=== o2hbmonitor ==="
pgrep -a o2hbmonitor

# print info
echo "=== o2cluster ==="
o2cluster -r

echo "=== o2cb_ctl ==="
o2cb_ctl -I -n $HOSTNAME

# create test disk
echo "=== losetup ==="
dd if=/dev/zero of=$DISK bs=1M count=200 2>&1
LOOP=$(losetup --find --show $DISK)

# test tools
echo "=== mkfs ==="
mkfs.ocfs2 --cluster-stack=o2cb --cluster-name=ocfs2 $LOOP 2>&1

# echo "=== o2image ==="
# o2image $LOOP $DISK.image
# ls -l $DISK.image

# echo "=== fsck ==="
# fsck.ocfs2 -f -y $LOOP 2>&1

# echo "=== o2cluster ==="
# o2cluster -o $LOOP

# echo "=== tunefs ==="
# tunefs.ocfs2 -L $DISK -N 3 -Q 'Label = %V\nNumSlots = %N\n' $LOOP

# echo "=== debugfs ==="
# debugfs.ocfs2 -R stats $LOOP

# echo "=== o2info ==="
# o2info --volinfo $LOOP

# echo "=== grow ==="
# dd if=/dev/zero of=$DISK bs=1M count=50 seek=200 2>&1
# losetup --set-capacity $LOOP
# tunefs.ocfs2 -S $LOOP

echo "=== mount ==="
mount $LOOP /mnt
df /mnt

# echo "=== mounted ==="
# mounted.ocfs2 -d
# mounted.ocfs2 -f

# echo "=== defragfs ==="
# cp -a /bin /mnt
# defragfs.ocfs2 -v /mnt

rafaeldtinoco commented 4 years ago

Last good test:

https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-eoan/eoan/amd64/o/ocfs2-tools/20190910_143817_4312f@/log.gz

Kernel: 5.2.0-15-generic

First failure:

https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-eoan/eoan/amd64/o/ocfs2-tools/20190918_060126_bd321@/log.gz

Kernel: 5.3.0-10-generic

Let me know if you want me to try to bisect it.. not sure if its straightforward as 5.2 and 5.3 trees aren't the same for our kernel (I might try with vanilla upstream if needed).

josephhz commented 4 years ago

Could you please check if your kernel has the following commit? e581595ea29c ocfs: no need to check return value of debugfs_create functions If so, it has been fixed by: b73eba2a867e ocfs2: fix the crash due to call ocfs2_get_dlm_debug once less Please have a try.

vvidic commented 4 years ago

I confirm this solves the problem for me on Debian side. Tested with v5.5-rc3-519-g6327edceb62b

rafaeldtinoco commented 4 years ago

Well, I'm using kernel 5.4.0-9 (which contains that commit) and still got:

[ 5160.614832] BUG: unable to handle page fault for address: ffff97e78969bad6
[ 5160.616217] #PF: supervisor read access in kernel mode
[ 5160.617268] #PF: error_code(0x0000) - not-present page
[ 5160.618289] PGD b4401067 P4D b4401067 PUD 0
[ 5160.619169] Oops: 0000 [#1] SMP NOPTI
[ 5160.619931] CPU: 1 PID: 1416 Comm: modprobe Kdump: loaded Not tainted 5.4.0-9-generic #12-Ubuntu
[ 5160.621632] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 5160.623228] RIP: 0010:__kmalloc_track_caller+0xa1/0x270
[ 5160.624264] Code: 65 49 8b 50 08 65 4c 03 05 e4 15 57 4e 4d 8b 38 4d 85 ff 0f 84 8e 01 00 00 41 8b 59 20 49 8b 39 48 8d 4a 01 4c 89 f8
4c 01 fb <48> 33 1b 49 33 99 70 01 00 00 65 48 0f c7 0f 0f 94 c0 84 c0 74 bd
[ 5160.627747] RSP: 0018:ffffb71d4048ba88 EFLAGS: 00010282
[ 5160.628792] RAX: ffff97e78969bad6 RBX: ffff97e78969bad6 RCX: 00000000000043be
[ 5160.630183] RDX: 00000000000043bd RSI: 0000000000000cc0 RDI: 000000000002f080
[ 5160.631564] RBP: ffffb71d4048bac0 R08: ffff97e7fbaaf080 R09: ffff97e7fb003880
[ 5160.632956] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000cc0
[ 5160.634337] R13: 0000000000000013 R14: ffff97e7fb003880 R15: ffff97e78969bad6
[ 5160.635718] FS:  00007f6c97fea500(0000) GS:ffff97e7fba80000(0000) knlGS:0000000000000000
[ 5160.637287] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5160.638425] CR2: ffff97e78969bad6 CR3: 000000013254a000 CR4: 00000000000406e0
[ 5160.639849] Call Trace:
[ 5160.640414]  ? kstrdup_const+0x24/0x30
[ 5160.641223]  kstrdup+0x32/0x60
[ 5160.641892]  kstrdup_const+0x24/0x30
[ 5160.642660]  __kernfs_new_node+0x4c/0x1b0
[ 5160.643512]  ? __kernfs_new_node+0x159/0x1b0
[ 5160.644397]  kernfs_new_node+0x37/0x60
[ 5160.645197]  __kernfs_create_file+0x29/0xc0
[ 5160.646076]  sysfs_add_file_mode_ns+0x9f/0x180
[ 5160.647005]  internal_create_group+0x115/0x370
[ 5160.647927]  sysfs_create_group+0x13/0x20
[ 5160.648781]  mod_sysfs_setup+0x3f5/0x620
[ 5160.649630]  load_module+0xfcc/0x1210
[ 5160.650410]  __do_sys_finit_module+0xbe/0x120
[ 5160.651324]  ? __do_sys_finit_module+0xbe/0x120
[ 5160.652270]  __x64_sys_finit_module+0x1a/0x20
[ 5160.653193]  do_syscall_64+0x57/0x190
[ 5160.653972]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 5160.655018] RIP: 0033:0x7f6c97bd395d
[ 5160.655780] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 03 e5 0c 00 f7 d8 64 89 01 48
[ 5160.659365] RSP: 002b:00007ffdf2ee1878 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 5160.660874] RAX: ffffffffffffffda RBX: 000055e5c686cb50 RCX: 00007f6c97bd395d
[ 5160.662299] RDX: 0000000000000000 RSI: 000055e5c4d9a348 RDI: 0000000000000005
[ 5160.663719] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000000
[ 5160.665144] R10: 0000000000000005 R11: 0000000000000246 R12: 000055e5c4d9a348
[ 5160.666564] R13: 0000000000000000 R14: 000055e5c686ccd0 R15: 000055e5c686cb50

I'll check with latest upstream, like vvidic did, and check. If it is good then I'll go ahead and bisect as I seem to have a good reproducer.

vvidic commented 4 years ago

On Mon, Jan 13, 2020 at 09:04:48AM -0800, Rafael David Tinoco wrote:

Well, I'm using kernel 5.4.0-9 (which contains that commit) and still got:

AFAICT, the fix is in a newer version than 5.4:

$ git describe e581595ea29c v5.2-5650-ge581595ea29c

$ git describe b73eba2a867e v5.5-rc3-224-gb73eba2a867e

-- Valentin

rafaeldtinoco commented 4 years ago

Oh, thanks vvidic! I gave a really fast look and saw e581595ea29c only, thinking it was the referred fix. Now it all makes more sense. I'll ask for a kernel stable fix on our side.

Thanks a lot!

vvidic commented 4 years ago

On Mon, Jan 13, 2020 at 11:00:13AM -0800, Rafael David Tinoco wrote:

Oh, thanks vvidic! I gave a really fast look and saw e581595ea29c only, thinking it was the referred fix. Now it all makes more sense. I'll ask for a kernel stable fix on our side.

I think 5.4 is LTS on the kernel side, please include it there too.

-- Valentin

rafaeldtinoco commented 4 years ago

Will do. For future reference if anyone finds this:

https://bugs.launchpad.net/ubuntu/+source/ocfs2-tools/+bug/1852122

markfasheh / ocfs2-tools

Linux 5.3.9 crash with ocfs2 #45