openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.69k stars 1.75k forks source link

Lustre IO causes hang on metadata operation in zap_lookup #10536

Closed omgold closed 3 years ago

omgold commented 4 years ago

System information

Distribution Name | Centos Distribution Version | 7.8 Linux Kernel | 3.10.0-1127.8.2.el7_lustre.x86_64 Architecture | x86_64 ZFS Version | 0.8.4 SPL Version | 0.8.4

Describe the problem you're observing

When doing heavy IO on a Lustre filesystem which uses ZFS for OSTs sometimes kernel threads get stuck in zfs io code. See kernel backtrace below.

Describe how to reproduce the problem

Probably difficult, as it apparently is racy. Running IOR against the Lustre from many nodes regularly causes the issue after an hour or so.

Include any warning/errors/backtraces from the system logs

020-06-15T14:41:33.528344+02:00 ossw02 <kern.warning> kernel: LNet: Service thread pid 84931 was inactive for 1203.27s. The thread might be hung, or it might only be slow and will resume

 later. Dumping the stack trace for debugging purposes:
2020-06-15T14:41:33.528364+02:00 ossw02 <kern.info> kernel: Pid: 84931, comm: ll_ost01_238 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019
2020-06-15T14:41:33.531732+02:00 ossw02 <kern.info> kernel: Call Trace:
2020-06-15T14:41:33.540841+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc0c8c54d>] __cv_timedwait_common+0x13d/0x190 [spl]
2020-06-15T14:41:33.547998+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc0c8c5d9>] __cv_timedwait_io+0x19/0x20 [spl]
2020-06-15T14:41:33.548012+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc12cf6e3>] zio_wait+0x133/0x280 [zfs]
2020-06-15T14:41:33.561171+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11d8e10>] dbuf_read+0x1f0/0x5e0 [zfs]
2020-06-15T14:41:33.568583+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11daf0e>] dbuf_hold_impl_arg+0x5be/0x640 [zfs]
2020-06-15T14:41:33.568607+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11daed6>] dbuf_hold_impl_arg+0x586/0x640 [zfs]
2020-06-15T14:41:33.582788+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11db02b>] dbuf_hold_impl+0x9b/0xc0 [zfs]
2020-06-15T14:41:33.582803+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11db085>] dbuf_hold_level+0x35/0x60 [zfs]
2020-06-15T14:41:33.595976+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11db0c6>] dbuf_hold+0x16/0x20 [zfs]
2020-06-15T14:41:33.595989+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11e6169>] dmu_buf_hold_noread+0x89/0x120 [zfs]
2020-06-15T14:41:33.609924+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11e623c>] dmu_buf_hold+0x3c/0x90 [zfs]
2020-06-15T14:41:33.616500+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc128863e>] zap_lockdir+0x4e/0xd0 [zfs]
2020-06-15T14:41:33.623444+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1288b40>] zap_lookup_norm+0x60/0xd0 [zfs]
2020-06-15T14:41:33.623459+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1288be3>] zap_lookup+0x33/0x40 [zfs]
2020-06-15T14:41:33.637124+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1b4f231>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs]
2020-06-15T14:41:33.637137+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1b46fe9>] osd_object_init+0x389/0xcc0 [osd_zfs]
2020-06-15T14:41:33.652562+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc161558b>] lu_object_start.isra.35+0x8b/0x120 [obdclass]
2020-06-15T14:41:33.652575+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1619461>] lu_object_find_at+0x1e1/0xa60 [obdclass]
2020-06-15T14:41:33.667391+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1619cf6>] lu_object_find+0x16/0x20 [obdclass]
2020-06-15T14:41:33.667402+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1c62185>] ofd_object_find+0x35/0x100 [ofd]
2020-06-15T14:41:33.681493+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1c5e382>] ofd_destroy_by_fid+0xa2/0x4a0 [ofd]
2020-06-15T14:41:33.681506+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1c56cc6>] ofd_create_hdl+0x1ac6/0x20e0 [ofd]
2020-06-15T14:41:33.696192+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc190a2ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
2020-06-15T14:41:33.704679+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc18af29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
2020-06-15T14:41:33.704694+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc18b2bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
2020-06-15T14:41:33.717262+02:00 ossw02 <kern.warning> kernel:  [<ffffffff978c50d1>] kthread+0xd1/0xe0
2020-06-15T14:41:33.717276+02:00 ossw02 <kern.warning> kernel:  [<ffffffff97f8cd1d>] ret_from_fork_nospec_begin+0x7/0x21
2020-06-15T14:41:33.736056+02:00 ossw02 <kern.warning> kernel:  [<ffffffffffffffff>] 0xffffffffffffff
ff
2020-06-15T14:41:33.736071+02:00 ossw02 <kern.alert> kernel: LustreError: dumping log to /tmp/lustre-log.1592224893.84931
2020-06-15T14:41:33.768359+02:00 ossw02 <kern.info> kernel: Pid: 84946, comm: ll_ost01_253 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Mon Oct 21 20:19:09 UTC 2019
2020-06-15T14:41:33.768380+02:00 ossw02 <kern.info> kernel: Call Trace:
2020-06-15T14:41:33.777750+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc0c8c54d>] __cv_timedwait_common+0x13d/0x190 [spl]
2020-06-15T14:41:33.777764+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc0c8c5d9>] __cv_timedwait_io+0x19/0x20 [spl]
2020-06-15T14:41:33.792352+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc12cf6e3>] zio_wait+0x133/0x280 [zfs]
2020-06-15T14:41:33.799182+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11d8e10>] dbuf_read+0x1f0/0x5e0 [zfs]
2020-06-15T14:41:33.799195+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11daf0e>] dbuf_hold_impl_arg+0x5be/0x640 [zfs]
2020-06-15T14:41:33.813738+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11db02b>] dbuf_hold_impl+0x9b/0xc0 [zfs]
2020-06-15T14:41:33.820814+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11db085>] dbuf_hold_level+0x35/0x60 [zfs]
2020-06-15T14:41:33.827906+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11db0c6>] dbuf_hold+0x16/0x20 [zfs]
2020-06-15T14:41:33.827921+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11e605c>] dmu_buf_hold_noread_by_dnode+0x3c/0xc0 [zfs]
2020-06-15T14:41:33.843735+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc11e62cc>] dmu_buf_hold_by_dnode+0x3c/0x90 [zfs]
2020-06-15T14:41:33.843752+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1282446>] zap_idx_to_blk+0xc6/0x1a0 [zfs]
2020-06-15T14:41:33.857785+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1282839>] zap_deref_leaf+0xc9/0x130 [zfs]
2020-06-15T14:41:33.857798+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1283882>] fzap_lookup+0x62/0x180 [zfs]
2020-06-15T14:41:33.871724+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc12881d1>] zap_lookup_impl+0xd1/0x200 [zfs]
2020-06-15T14:41:33.871739+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1288b6e>] zap_lookup_norm+0x8e/0xd0 [zfs]
2020-06-15T14:41:33.884974+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1288be3>] zap_lookup+0x33/0x40 [zfs]
2020-06-15T14:41:33.884991+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1b4f231>] osd_fid_lookup+0x3e1/0x4c0 [osd_zfs]
2020-06-15T14:41:33.899535+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1b46fe9>] osd_object_init+0x389/0xcc0 [osd_zfs]
2020-06-15T14:41:33.899550+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc161558b>] lu_object_start.isra.35+0x8b/0x120 [obdclass]
2020-06-15T14:41:33.915152+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1619461>] lu_object_find_at+0x1e1/0xa60 [obdclass]
2020-06-15T14:41:33.915166+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1619cf6>] lu_object_find+0x16/0x20 [obdclass]
2020-06-15T14:41:33.929298+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1c62185>] ofd_object_find+0x35/0x100 [ofd]
2020-06-15T14:41:33.929310+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1c5e382>] ofd_destroy_by_fid+0xa2/0x4a0 [ofd]
2020-06-15T14:41:33.943470+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc1c56cc6>] ofd_create_hdl+0x1ac6/0x20e0 [ofd]
2020-06-15T14:41:33.943483+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc190a2ea>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
2020-06-15T14:41:33.959397+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc18af29b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
2020-06-15T14:41:33.959410+02:00 ossw02 <kern.warning> kernel:  [<ffffffffc18b2bfc>] ptlrpc_main+0xb2c/0x1460 [ptlrpc]
2020-06-15T14:41:33.971945+02:00 ossw02 <kern.warning> kernel:  [<ffffffff978c50d1>] kthread+0xd1/0xe0
2020-06-15T14:41:33.971959+02:00 ossw02 <kern.warning> kernel:  [<ffffffff97f8cd1d>] ret_from_fork_nospec_begin+0x7/0x21
2020-06-15T14:41:33.983647+02:00 ossw02 <kern.warning> kernel:  [<ffffffffffffffff>] 0xffffffffffffffff
knweiss commented 4 years ago

I'm wondering: The log shows kernel version "3.10.0-1062.1.1.el7_lustre.x86_64" (Lustre 2.13.0 or 2.12.3 on CentOS 7.7?) but you wrote "CentOS 7.8" and "3.10.0-1127.8.2.el7_lustre.x86_64" in the System Information section which is probably Lustre 2.12.5. (All these Lustre version were released with ZFS 0.7.13 as far as I remember.)

omgold commented 4 years ago

@knweiss Ah, sorry, the log has been from a short time before the kernel was updated. The modules are all built by myself, so versions does not match the ones in the Lustre repos. The actual versions are then 3.10.0-1062.1.1.el7_lustre.x86_64 for the kernel (as in the log) and Lustre 2.12.4. ZFS version 0.8.4 is correct, though.

stale[bot] commented 3 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.