fwaggle commented 6 years ago

System information

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	18.04 LTS (Bionic Beaver)
Linux Kernel	4.15.0-23-generic #25-Ubuntu SMP
Architecture	x86_64
ZFS Version	v0.7.5-1ubuntu15
SPL Version	0.7.5-1ubuntu1

When using ZFS-backed LXC/LXD, hung tasks on concurrent container creation

We run into this very occasionally in production, and I managed to distill a test scenario down to guarantee I run into it.

With multiple concurrent LXD operations ongoing, ZFS will end up in a stuck state and block forever.

We can reproduce this on JonF's 0.7.9 builds as well. I can't seem to work out a scenario not involving LXC.

I spoke to some folks on Freenode about this issue, and the general consensus seems to be "turn dedup off", which we're investigating, but as it's so trivially reproducible for me I figured I'd file a report with all the data anyway.

Describe how to reproduce the problem

Test Environment:

VirtualBox VM: 8GB RAM, 4 vCPUs. Host is i7-7700k, 16GB non-ECC, on Windows 10. Both vdisks are on a 7200rpm SATA.

10GB vDisk for Root Partition - sda 100GB vDisk for ZFS pool - sdb

Install Ubuntu, install LXD 3.1 from snap, initialize lxd with a zfs storage pool on sdb, then turn dedup on for the pool.

No other configuration changes are made.

Run three or four concurrent jobs repeatedly running lxc launch, and quite quickly all disk activity will stop and two minutes later you'll get hung task messages on the console. I run four of these and it locks ZFS up very quickly:

for i in {1..20}; do lxc launch ubuntu:lts/amd64; done

After only a couple of successful container creations, everything will stop and the system will go completely idle. Any further attempts to interact with ZFS will block forever and a reboot is required.

The system never recovers, I left it overnight and it stayed stuck.

Include any warning/errors/backtraces from the system logs

[ 1450.975346] INFO: task txg_sync:888 blocked for more than 120 seconds.
[ 1450.975365]       Tainted: P         C OE    4.15.0-23-generic #25-Ubuntu
[ 1450.975377] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1450.975390] txg_sync        D    0   888      2 0x80000000
[ 1450.975391] Call Trace:
[ 1450.975396]  __schedule+0x297/0x8b0
[ 1450.975398]  schedule+0x2c/0x80
[ 1450.975402]  cv_wait_common+0x11e/0x140 [spl]
[ 1450.975404]  ? wait_woken+0x80/0x80
[ 1450.975407]  __cv_wait+0x15/0x20 [spl]
[ 1450.975431]  rrw_enter_write+0x3c/0xa0 [zfs]
[ 1450.975446]  rrw_enter+0x13/0x20 [zfs]
[ 1450.975461]  spa_sync+0x7c9/0xd80 [zfs]
[ 1450.975477]  txg_sync_thread+0x2cd/0x4a0 [zfs]
[ 1450.975492]  ? txg_quiesce_thread+0x3d0/0x3d0 [zfs]
[ 1450.975495]  thread_generic_wrapper+0x74/0x90 [spl]
[ 1450.975497]  kthread+0x121/0x140
[ 1450.975498]  ? __thread_exit+0x20/0x20 [spl]
[ 1450.975500]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 1450.975501]  ret_from_fork+0x35/0x40
[ 1450.975509] INFO: task lxd:15361 blocked for more than 120 seconds.
[ 1450.975520]       Tainted: P         C OE    4.15.0-23-generic #25-Ubuntu
[ 1450.975531] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1450.975544] lxd             D    0 15361   1415 0x00000000
[ 1450.975545] Call Trace:
[ 1450.975547]  __schedule+0x297/0x8b0
[ 1450.975548]  schedule+0x2c/0x80
[ 1450.975549]  rwsem_down_write_failed+0x162/0x360
[ 1450.975560]  ? dbuf_rele_and_unlock+0x1b2/0x4b0 [zfs]
[ 1450.975562]  call_rwsem_down_write_failed+0x17/0x30
[ 1450.975562]  ? call_rwsem_down_write_failed+0x17/0x30
[ 1450.975563]  down_write+0x2d/0x40
[ 1450.975565]  grab_super+0x30/0x90
[ 1450.975580]  ? zpl_create+0x160/0x160 [zfs]
[ 1450.975581]  sget_userns+0x91/0x490
[ 1450.975582]  ? get_anon_bdev+0x100/0x100
[ 1450.975597]  ? zpl_create+0x160/0x160 [zfs]
[ 1450.975598]  sget+0x7d/0xa0
[ 1450.975599]  ? get_anon_bdev+0x100/0x100
[ 1450.975613]  zpl_mount+0xa8/0x160 [zfs]
[ 1450.975614]  mount_fs+0x37/0x150
[ 1450.975616]  vfs_kern_mount.part.23+0x5d/0x110
[ 1450.975617]  do_mount+0x5ed/0xce0
[ 1450.975618]  SyS_mount+0x98/0xe0
[ 1450.975620]  do_syscall_64+0x73/0x130
[ 1450.975621]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 1450.975622] RIP: 0033:0x4dbfaa
[ 1450.975623] RSP: 002b:000000c434d88890 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
[ 1450.975624] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004dbfaa
[ 1450.975625] RDX: 000000c42b6a5e6c RSI: 000000c42f3274a0 RDI: 000000c4258d18c0
[ 1450.975625] RBP: 000000c434d88930 R08: 000000c434bf9aa0 R09: 0000000000000000
[ 1450.975626] R10: 0000000000000000 R11: 0000000000000206 R12: ffffffffffffffff
[ 1450.975626] R13: 0000000000000043 R14: 0000000000000042 R15: 0000000000000066
[ 1450.975865] INFO: task lxd:11424 blocked for more than 120 seconds.
[ 1450.975877]       Tainted: P         C OE    4.15.0-23-generic #25-Ubuntu
[ 1450.975888] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1450.975900] lxd             D    0 11424  11371 0x00000002
[ 1450.975901] Call Trace:
[ 1450.975903]  __schedule+0x297/0x8b0
[ 1450.975905]  schedule+0x2c/0x80
[ 1450.975908]  cv_wait_common+0x11e/0x140 [spl]
[ 1450.975910]  ? wait_woken+0x80/0x80
[ 1450.975912]  __cv_wait+0x15/0x20 [spl]
[ 1450.975934]  txg_wait_synced+0xdd/0x130 [zfs]
[ 1450.975950]  zil_create+0x249/0x300 [zfs]
[ 1450.975965]  zil_commit_writer+0x6d1/0x7c0 [zfs]
[ 1450.975979]  zil_commit.part.14+0x93/0x100 [zfs]
[ 1450.975992]  zil_commit+0x17/0x20 [zfs]
[ 1450.976006]  zfs_sync+0x6e/0xb0 [zfs]
[ 1450.976021]  zpl_sync_fs+0x50/0x80 [zfs]
[ 1450.976022]  __sync_filesystem+0x33/0x60
[ 1450.976023]  sync_filesystem+0x39/0x40
[ 1450.976024]  generic_shutdown_super+0x27/0x120
[ 1450.976025]  kill_anon_super+0x12/0x20
[ 1450.976039]  zpl_kill_sb+0x1a/0x20 [zfs]
[ 1450.976040]  deactivate_locked_super+0x48/0x80
[ 1450.976041]  deactivate_super+0x40/0x60
[ 1450.976042]  cleanup_mnt+0x3f/0x80
[ 1450.976043]  __cleanup_mnt+0x12/0x20
[ 1450.976045]  task_work_run+0x9d/0xc0
[ 1450.976046]  exit_to_usermode_loop+0xc0/0xd0
[ 1450.976047]  do_syscall_64+0x115/0x130
[ 1450.976049]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 1450.976049] RIP: 0033:0x7fa49974b57d
[ 1450.976050] RSP: 002b:00007ffc7f7095c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
[ 1450.976051] RAX: 0000000000000000 RBX: 00007ffc7f70a6f0 RCX: 00007fa49974b57d
[ 1450.976052] RDX: 00007fa499bebadf RSI: 0000000000000002 RDI: 000000000000000c
[ 1450.976052] RBP: 000000000000000f R08: 0000000000000000 R09: 0000000000000001
[ 1450.976053] R10: 0000000000000751 R11: 0000000000000293 R12: 000000000000000c
[ 1450.976053] R13: 0000000002647040 R14: 0000000002647380 R15: 000000000264aaa0
[ 1450.976055] INFO: task sync:11745 blocked for more than 120 seconds.
[ 1450.976066]       Tainted: P         C OE    4.15.0-23-generic #25-Ubuntu
[ 1450.976077] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1450.976089] sync            D    0 11745  26377 0x00000104
[ 1450.976091] Call Trace:
[ 1450.976092]  __schedule+0x297/0x8b0
[ 1450.976094]  schedule+0x2c/0x80
[ 1450.976095]  io_schedule+0x16/0x40
[ 1450.976097]  wait_on_page_bit_common+0xd8/0x160
[ 1450.976098]  ? page_cache_tree_insert+0xe0/0xe0
[ 1450.976099]  __filemap_fdatawait_range+0xfa/0x160
[ 1450.976100]  ? __schedule+0x230/0x8b0
[ 1450.976101]  ? _cond_resched+0x19/0x40
[ 1450.976102]  ? bdi_split_work_to_wbs+0x45/0x2c0
[ 1450.976104]  ? finish_wait+0x56/0x70
[ 1450.976105]  ? _cond_resched+0x19/0x40
[ 1450.976106]  filemap_fdatawait_keep_errors+0x1e/0x40
[ 1450.976107]  sync_inodes_sb+0x20d/0x2b0
[ 1450.976108]  sync_inodes_one_sb+0x16/0x20
[ 1450.976109]  iterate_supers+0xa2/0x110
[ 1450.976110]  ? SyS_tee+0x3d0/0x3d0
[ 1450.976111]  sys_sync+0x42/0xb0
[ 1450.976112]  do_syscall_64+0x73/0x130
[ 1450.976113]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 1450.976114] RIP: 0033:0x7f16d89207c7
[ 1450.976114] RSP: 002b:00007ffcf1abef68 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
[ 1450.976115] RAX: ffffffffffffffda RBX: 00007ffcf1abf098 RCX: 00007f16d89207c7
[ 1450.976116] RDX: 00007f16d8be9ea0 RSI: 0000000000404740 RDI: 00007f16d89b1421
[ 1450.976116] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[ 1450.976117] R10: 000000000000081d R11: 0000000000000202 R12: 0000000000000000
[ 1450.976117] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
[ 1450.976119] INFO: task zfs:11808 blocked for more than 120 seconds.
[ 1450.976129]       Tainted: P         C OE    4.15.0-23-generic #25-Ubuntu
[ 1450.976140] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1450.976153] zfs             D    0 11808   1912 0x00000000
[ 1450.976154] Call Trace:
[ 1450.976155]  __schedule+0x297/0x8b0
[ 1450.976157]  schedule+0x2c/0x80
[ 1450.976160]  cv_wait_common+0x11e/0x140 [spl]
[ 1450.976161]  ? wait_woken+0x80/0x80
[ 1450.976163]  __cv_wait+0x15/0x20 [spl]
[ 1450.976179]  rrw_enter_read_impl+0x4e/0x160 [zfs]
[ 1450.976193]  rrw_enter+0x1c/0x20 [zfs]
[ 1450.976206]  dsl_pool_hold+0x5a/0x80 [zfs]
[ 1450.976217]  dmu_objset_hold+0x33/0xa0 [zfs]
[ 1450.976233]  zfs_ioc_objset_stats+0x32/0xa0 [zfs]
[ 1450.976248]  zfsdev_ioctl+0x1e0/0x610 [zfs]
[ 1450.976250]  do_vfs_ioctl+0xa8/0x630
[ 1450.976252]  ? handle_mm_fault+0xb1/0x1f0
[ 1450.976253]  ? __do_page_fault+0x270/0x4d0
[ 1450.976254]  ? do_sys_open+0x13e/0x2c0
[ 1450.976256]  SyS_ioctl+0x79/0x90
[ 1450.976257]  do_syscall_64+0x73/0x130
[ 1450.976258]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 1450.976259] RIP: 0033:0x7f8293a95f47
[ 1450.976259] RSP: 002b:00007ffda41fdc68 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 1450.976260] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8293a95f47
[ 1450.976261] RDX: 00007ffda41fdcc0 RSI: 0000000000005a12 RDI: 0000000000000003
[ 1450.976261] RBP: 00007ffda41fdc90 R08: 000000000000ffc0 R09: 0000000000000034
[ 1450.976262] R10: 000000000000053f R11: 0000000000000202 R12: 0000000000404e10
[ 1450.976262] R13: 00007ffda4201c40 R14: 0000000000000000 R15: 0000000000000000
[ 1450.976270] INFO: task zfs:13733 blocked for more than 120 seconds.
[ 1450.976289]       Tainted: P         C OE    4.15.0-23-generic #25-Ubuntu
[ 1450.976300] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1450.976313] zfs             D    0 13733  13702 0x00000000
[ 1450.976314] Call Trace:
[ 1450.976316]  __schedule+0x297/0x8b0
[ 1450.976317]  schedule+0x2c/0x80
[ 1450.976320]  cv_wait_common+0x11e/0x140 [spl]
[ 1450.976321]  ? wait_woken+0x80/0x80
[ 1450.976323]  __cv_wait+0x15/0x20 [spl]
[ 1450.976339]  rrw_enter_read_impl+0x4e/0x160 [zfs]
[ 1450.976352]  rrw_enter+0x1c/0x20 [zfs]
[ 1450.976366]  dsl_pool_hold+0x5a/0x80 [zfs]
[ 1450.976376]  dmu_objset_hold+0x33/0xa0 [zfs]
[ 1450.976392]  zfs_ioc_objset_stats+0x32/0xa0 [zfs]
[ 1450.976407]  zfsdev_ioctl+0x1e0/0x610 [zfs]
[ 1450.976409]  do_vfs_ioctl+0xa8/0x630
[ 1450.976409]  ? handle_mm_fault+0xb1/0x1f0
[ 1450.976411]  ? __do_page_fault+0x270/0x4d0
[ 1450.976412]  ? do_sys_open+0x13e/0x2c0
[ 1450.976413]  SyS_ioctl+0x79/0x90
[ 1450.976414]  do_syscall_64+0x73/0x130
[ 1450.976415]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 1450.976416] RIP: 0033:0x7f8b8ae11f47
[ 1450.976416] RSP: 002b:00007ffd3ef30f68 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
[ 1450.976417] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8b8ae11f47
[ 1450.976418] RDX: 00007ffd3ef30fc0 RSI: 0000000000005a12 RDI: 0000000000000003
[ 1450.976418] RBP: 00007ffd3ef30f90 R08: 000000000000fffc R09: 0000000000000030
[ 1450.976419] R10: 000000000000053f R11: 0000000000000202 R12: 0000000000404e10
[ 1450.976419] R13: 00007ffd3ef34f40 R14: 0000000000000000 R15: 0000000000000000

perf top output:

Samples: 44K of event 'cpu-clock', Event count (approx.): 285361296
Overhead  Shared Object             Symbol                                                                                                        ◆
  14.32%  [kernel]                  [k] finish_task_switch                                                                                        ▒
   8.07%  [kernel]                  [k] _raw_spin_unlock_irqrestore                                                                               ▒
   5.00%  lxd                       [.] runtime.gentraceback                                                                                      ▒
   3.04%  [kernel]                  [k] memcpy                                                                                                    ▒
   2.23%  [kernel]                  [k] do_syscall_64                                                                                             ▒
   2.07%  libc-2.27.so              [.] _IO_vfscanf                                                                                               ▒
   1.63%  [kernel]                  [k] do_idle                                                                                                   ▒
   1.54%  [kernel]                  [k] tick_nohz_idle_enter                                                                                      ▒
   1.33%  [kernel]                  [k] format_decode                                                                                             ▒
   1.15%  [kernel]                  [k] rcu_idle_exit                                                                                             ▒
   1.14%  [kernel]                  [k] __softirqentry_text_start                                                                                 ▒
   1.08%  [kernel]                  [k] vsnprintf                                                                                                 ▒
   1.01%  [kernel]                  [k] __d_lookup                                                                                                ▒
   0.87%  [kernel]                  [k] next_tgid                                                                                                 ▒
   0.76%  lxd                       [.] runtime.procyield                                                                                         ▒
   0.73%  [kernel]                  [k] __d_lookup_rcu                                                                                            ▒
   0.68%  libc-2.23.so              [.] poll                                                                                                      ▒
   0.68%  [kernel]                  [k] do_task_stat                                                                                              ▒
   0.68%  [kernel]                  [k] e1000_alloc_rx_buffers                                                                                    ▒
   0.61%  [kernel]                  [k] number                                                                                                    ▒
   0.58%  [kernel]                  [k] tick_nohz_idle_exit                                                                                       ▒
   0.53%  [kernel]                  [k] get_pid_task                                                                                              ▒
   0.53%  [kernel]                  [k] e1000_watchdog                                                                                            ▒
   0.50%  [kernel]                  [k] seq_put_decimal_ull                                                                                       ▒
   0.50%  [kernel]                  [k] sock_poll                                                                                                 ▒
   0.47%  [kernel]                  [k] _raw_spin_lock                                                                                            ▒
   0.46%  [kernel]                  [k] get_wchan                                                                                                 ▒
   0.46%  libc-2.27.so              [.] vfprintf                                                                                                  ▒
   0.45%  [kernel]                  [k] __fget_light                                                                                              ▒
   0.43%  [kernel]                  [k] copy_user_generic_string                                                                                  ▒
   0.39%  libc-2.27.so              [.] open64

Stuck processes after the event:

  PID TTY      STAT   TIME COMMAND
 2145 ?        D      0:07 [txg_sync]
20527 ?        D      0:00 [lxc monitor] /var/snap/lxd/common/lxd/containers glad-tahr
20546 ?        D      0:00 /bin/umount -t zfs /var/snap/lxd/common/lxd/storage-pools/default/containers/novel-oarfish
21165 ?        D      0:00 zfs get -H -o name type default/containers/infinite-akita
21173 ?        D      0:00 zfs get -H -o name type default/containers/capital-cod
21404 ?        D      0:00 /sbin/apparmor_parser --write-cache --replace --cache-loc=/etc/apparmor.d/cache -- /etc/apparmor.d/lxc-containers
21427 ?        D      0:00 /sbin/apparmor_parser --write-cache --replace --cache-loc=/etc/apparmor.d/cache -- /etc/apparmor.d/lxc-containers
21764 pts/6    S+     0:00 grep --color=auto D

zpool status -D:

  pool: default
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    default     ONLINE       0     0     0
      sdb       ONLINE       0     0     0

errors: No known data errors

 dedup: DDT entries 2862, size 322B on disk, 191B in core

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    2.68K   13.4M   3.09M   3.09M    2.68K   13.4M   3.09M   3.09M
     2       15    900K    193K    193K       35   2.13M    482K    482K
     4       43    149K   46.5K   46.5K      172    596K    186K    186K
     8        3    384K     98K     98K       39   4.88M   1.24M   1.24M
    16        3    384K     98K     98K       61   7.62M   1.94M   1.94M
    32       49   2.94M   1.17M   1.17M    1.86K    111M   45.3M   45.3M
    64        5      3K   2.50K   2.50K      394    236K    197K    197K
   128        2      1K      1K      1K      372    186K    186K    186K
 Total    2.79K   18.1M   4.70M   4.70M    5.59K    140M   52.6M   52.6M

iostat -x for the zpool device:

Linux 4.15.0-23-generic (lxdtest)   06/26/2018  _x86_64_    (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.68    0.00    1.99    2.91    0.00   93.42

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sdb              3.37   78.41     24.86   1702.01     0.02     2.14   0.45   2.66   50.16    0.76   0.23     7.37    21.71   1.39  11.34

claudiusraphaelpaeth commented 6 years ago

Now, we are two. Means it is in fact, exactly as you described. Doing the exact same with xenial+hwe in contrast, leads to extreme slowdown and from there to unusable anymore. SSD vs. HDD makes no difference, also. In bionic it is a nearly immediate please-kill-yourself. The situation on a customized-proxmox- and a vsido- -based setup is comparable.

When i eventually will be able to narrow down the specific surroundings, i will add a detailed comment, too.

@fwaggle, did you by chance experiment with different swap-settings/-types(zram/zcache/zswap) and/or limiting memory-reservation/caching-metadata-only and similar tunables?

rincebrain commented 6 years ago

7598 and #7648 are also other fun with LXD and ZoL, so you're probably more than two.

fwaggle commented 6 years ago

@claudiusraphaelpaeth I'm afraid I didn't do any checking on that, I didn't think to look at swap as once I got it to die rather quickly, the servers aren't even touching swap. I'm not sure where to start with tweaking ZFS tunables either... I didn't save the top output, but as I understand when I got it to reliably hang instantly, the CPU + memory load is extremely low - nothing stood out as a potential problem.

I'd definitely welcome any suggestions for things I can try though, since I can reproduce this outside of a production environment I can test just about anything. Unfortunately so far the only useful guidance has been "turn off dedup" which is certainly good advice (I can't reproduce the problem at all unless dedup is on, I'm convinced it goes away entirely) but not applicable for this exact use case. :(

simos commented 6 years ago

I got the same issue with Ubuntu 18.04. It is not clear whether it is an issue with the ZFS/SPL versions in the Ubuntu Linux kernel or whether it is an upstream issue.

simos commented 6 years ago

It suggests here to set spl_kmem_cache_slab_limit to 0.

Update: this workaround works.

rincebrain commented 6 years ago

You may find #7693 interesting.

behlendorf commented 6 years ago

I've opened #7693 with a proposed fix for this issue. @simos @claudiusraphaelpaeth @fwaggle if it all possible it would be very helpful if you could cherry-pick the fix and verify it resolves the deadlock your seeing. @fwaggle thanks for the detailed bug report, it had everything needed to get to the root cause.

fwaggle commented 6 years ago

@behlendorf Thanks!

I built ZFS 0.7.9 on Ubuntu Bionic (4.15.0-24-generic), verified I can reproduce the issue on the installed module (it hung after 1 container doing 5 in parallel), then applied the patch in #7693, reinstalled the module, and created 5x20 and 6x20 alpine containers in parallel without any issues a couple of times. Not super-rigorous testing, but I'm convinced.

DeadlyMercury commented 6 years ago

I can confirm this bug is fixed by #7693 in my situation too: previously I used old kernel (4.13) to avoid this bug. But I want to perform clean install with new version of zfsutils (will you backport this fix to 0.7.5 as bionic stable?) and check it again before trying this on my servers :)

behlendorf commented 6 years ago

@DeadlyMercury we'll get it in our next point release. Canonical will have to decide if they want to apply it to the version they're shipping.

simos commented 6 years ago

I have installed zfs (master: 2e5dc44) and lxd (master: 140c4a6). LXD has some recent performance fixes.

I use lxd-benchmark and created 384 containers in batches of 24. I repeated this several times.

I did not manage to get ZFS to crash/deadlock.

ubuntu@myserver:~$ time lxd-benchmark launch --count 384 --parallel 24
Test environment:
  Server backend: lxd
  Server version: 3.2
  Kernel: Linux
  Kernel architecture: x86_64
  Kernel version: 4.15.0-20-generic
  Storage backend: zfs
  Storage version: 0.7.0-1472_g2e5dc449c
  Container backend: lxc
  Container version: 3.0.1

Test variables:
  Container count: 384
  Container mode: unprivileged
  Startup mode: normal startup
  Image: ubuntu:
  Batches: 16
  Batch size: 24
  Remainder: 0

[Jul 12 19:44:41.326] Found image in local store: f2228450779fee27020d6024af587379b8f51062c32a335327f2b028c924bfa1
[Jul 12 19:44:41.327] Batch processing start
[Jul 12 19:44:54.448] Processed 24 containers in 13.121s (1.829/s)
[Jul 12 19:45:11.758] Processed 48 containers in 30.432s (1.577/s)
[Jul 12 19:45:57.044] Processed 96 containers in 75.718s (1.268/s)
[Jul 12 19:48:14.077] Processed 192 containers in 212.750s (0.902/s)
[Jul 12 19:55:58.700] Processed 384 containers in 677.373s (0.567/s)
[Jul 12 19:55:58.700] Batch processing completed in 677.373s

real    11m17,866s
user    0m6,908s
sys 0m1,208s
ubuntu@myserver:~$

simos commented 6 years ago

I continued testing and I managed to get a crash with ZFS (master: commit 2e5dc44). zfs-0.7.0-1472_g2e5dc449c

[10271.564074] INFO: task systemd:1 blocked for more than 120 seconds.
[10271.570397]       Tainted: P           OE    4.15.0-20-generic #21-Ubuntu
[10271.577212] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10271.585071] systemd         D    0     1      0 0x00000000
[10271.585073] Call Trace:
[10271.585080]  __schedule+0x297/0x8b0
[10271.585086]  schedule+0x2c/0x80
[10271.585090]  schedule_preempt_disabled+0xe/0x10
[10271.585095]  __mutex_lock.isra.2+0x18c/0x4d0
[10271.585100]  __mutex_lock_slowpath+0x13/0x20
[10271.585101]  ? __mutex_lock_slowpath+0x13/0x20
[10271.585102]  mutex_lock+0x2f/0x40
[10271.585106]  proc_cgroup_show+0x4c/0x2a0
[10271.585108]  proc_single_show+0x56/0x80
[10271.585111]  seq_read+0xe5/0x430
[10271.585114]  __vfs_read+0x1b/0x40
[10271.585115]  vfs_read+0x8e/0x130
[10271.585117]  SyS_read+0x55/0xc0
[10271.585120]  do_syscall_64+0x73/0x130
[10271.585121]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[10271.585123] RIP: 0033:0x7fd47634d0b4
[10271.585124] RSP: 002b:00007ffe9a186700 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[10271.585125] RAX: ffffffffffffffda RBX: 0000000000000021 RCX: 00007fd47634d0b4
[10271.585126] RDX: 0000000000000400 RSI: 000055f70a08c0c0 RDI: 0000000000000021
[10271.585127] RBP: 000055f70a08c0c0 R08: 0000000000000000 R09: 0000000000000000
[10271.585128] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000400
[10271.585129] R13: 00007fd4766252a0 R14: 000055f70a0298e0 R15: 00000000000007ff
[10271.585221] INFO: task lxcfs:84510 blocked for more than 120 seconds.
[10271.591687]       Tainted: P           OE    4.15.0-20-generic #21-Ubuntu
[10271.598531] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10271.606401] lxcfs           D    0 84510      1 0x00000000
[10271.606407] Call Trace:
[10271.606416]  __schedule+0x297/0x8b0
[10271.606418]  schedule+0x2c/0x80
[10271.606420]  schedule_preempt_disabled+0xe/0x10
[10271.606421]  __mutex_lock.isra.2+0x18c/0x4d0
[10271.606423]  __mutex_lock_slowpath+0x13/0x20
[10271.606424]  ? __mutex_lock_slowpath+0x13/0x20
[10271.606425]  mutex_lock+0x2f/0x40
[10271.606427]  proc_cgroup_show+0x4c/0x2a0
[10271.606429]  proc_single_show+0x56/0x80
[10271.606432]  seq_read+0xe5/0x430
[10271.606434]  __vfs_read+0x1b/0x40
[10271.606436]  vfs_read+0x8e/0x130
[10271.606437]  SyS_read+0x55/0xc0
[10271.606440]  do_syscall_64+0x73/0x130
[10271.606441]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[10271.606443] RIP: 0033:0x7f1a50e700b4
[10271.606443] RSP: 002b:00007f1a2b7fd870 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[10271.606445] RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f1a50e700b4
[10271.606446] RDX: 0000000000000400 RSI: 00007f1a04000f90 RDI: 0000000000000013
[10271.606446] RBP: 00007f1a04000f90 R08: 0000000000000001 R09: 0000000000000000
[10271.606447] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000400
[10271.606448] R13: 00007f1a511482a0 R14: 0000000000000000 R15: 00007f1a38003a40
[10271.606450] INFO: task lxcfs:84676 blocked for more than 120 seconds.
[10271.612911]       Tainted: P           OE    4.15.0-20-generic #21-Ubuntu
[10271.619722] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10271.627577] lxcfs           D    0 84676      1 0x00000000
[10271.627579] Call Trace:
[10271.627583]  __schedule+0x297/0x8b0
[10271.627589]  schedule+0x2c/0x80
[10271.627594]  schedule_preempt_disabled+0xe/0x10
[10271.627598]  __mutex_lock.isra.2+0x18c/0x4d0
[10271.627600]  __mutex_lock_slowpath+0x13/0x20
[10271.627601]  ? __mutex_lock_slowpath+0x13/0x20
[10271.627602]  mutex_lock+0x2f/0x40
[10271.627604]  proc_cgroup_show+0x4c/0x2a0
[10271.627606]  proc_single_show+0x56/0x80
[10271.627608]  seq_read+0xe5/0x430
[10271.627610]  __vfs_read+0x1b/0x40
[10271.627611]  vfs_read+0x8e/0x130
[10271.627613]  SyS_read+0x55/0xc0
[10271.627615]  do_syscall_64+0x73/0x130
[10271.627617]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[10271.627618] RIP: 0033:0x7f1a50e700b4
[10271.627618] RSP: 002b:00007f1a297f9870 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[10271.627620] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 00007f1a50e700b4
[10271.627620] RDX: 0000000000000400 RSI: 00007f1a2c001e80 RDI: 0000000000000014
[10271.627621] RBP: 00007f1a2c001e80 R08: 0000000000000001 R09: 0000000000000000
[10271.627622] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000400
[10271.627622] R13: 00007f1a511482a0 R14: 0000000000000000 R15: 00007f1a2c002b10
[10271.627625] INFO: task lxcfs:115507 blocked for more than 120 seconds.
[10271.634180]       Tainted: P           OE    4.15.0-20-generic #21-Ubuntu
[10271.641007] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10271.648870] lxcfs           D    0 115507      1 0x00000000
[10271.648872] Call Trace:
[10271.648876]  __schedule+0x297/0x8b0
[10271.648883]  schedule+0x2c/0x80
[10271.648887]  schedule_preempt_disabled+0xe/0x10
[10271.648891]  __mutex_lock.isra.2+0x18c/0x4d0
[10271.648895]  __mutex_lock_slowpath+0x13/0x20
[10271.648896]  ? __mutex_lock_slowpath+0x13/0x20
[10271.648897]  mutex_lock+0x2f/0x40
[10271.648899]  proc_cgroup_show+0x4c/0x2a0
[10271.648902]  proc_single_show+0x56/0x80
[10271.648904]  seq_read+0xe5/0x430
[10271.648906]  __vfs_read+0x1b/0x40
[10271.648907]  vfs_read+0x8e/0x130
[10271.648909]  SyS_read+0x55/0xc0
[10271.648911]  do_syscall_64+0x73/0x130
[10271.648913]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[10271.648914] RIP: 0033:0x7f1a50e700b4
[10271.648915] RSP: 002b:00007f1a29ffa870 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[10271.648916] RAX: ffffffffffffffda RBX: 0000000000000015 RCX: 00007f1a50e700b4
[10271.648917] RDX: 0000000000000400 RSI: 00007f19f0001820 RDI: 0000000000000015
[10271.648918] RBP: 00007f19f0001820 R08: 0000000000000001 R09: 0000000000000000
[10271.648925] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000400
[10271.648928] R13: 00007f1a511482a0 R14: 0000000000000000 R15: 00007f19f00024d0
[10271.648964] INFO: task kworker/14:1:111280 blocked for more than 120 seconds.
[10271.656131]       Tainted: P           OE    4.15.0-20-generic #21-Ubuntu
[10271.662944] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10271.670822] kworker/14:1    D    0 111280      2 0x80000000
[10271.670827] Workqueue: cgroup_destroy css_killed_work_fn
[10271.670828] Call Trace:
[10271.670832]  __schedule+0x297/0x8b0
[10271.670834]  schedule+0x2c/0x80
[10271.670836]  schedule_preempt_disabled+0xe/0x10
[10271.670837]  __mutex_lock.isra.2+0x18c/0x4d0
[10271.670841]  ? wake_up_q+0x44/0x80
[10271.670842]  __mutex_lock_slowpath+0x13/0x20
[10271.670843]  ? __mutex_lock_slowpath+0x13/0x20
[10271.670844]  mutex_lock+0x2f/0x40
[10271.670845]  css_killed_work_fn+0x1f/0xe0
[10271.670848]  process_one_work+0x1de/0x410
[10271.670849]  worker_thread+0x32/0x410
[10271.670851]  kthread+0x121/0x140
[10271.670852]  ? process_one_work+0x410/0x410
[10271.670854]  ? kthread_create_worker_on_cpu+0x70/0x70
[10271.670856]  ? kthread_create_worker_on_cpu+0x70/0x70
[10271.670857]  ret_from_fork+0x22/0x40
[10271.670859] INFO: task kworker/29:0:111571 blocked for more than 120 seconds.
[10271.678018]       Tainted: P           OE    4.15.0-20-generic #21-Ubuntu
[10271.684827] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10271.692681] kworker/29:0    D    0 111571      2 0x80000000
[10271.692686] Workqueue: cgroup_destroy css_release_work_fn
[10271.692691] Call Trace:
[10271.692694]  __schedule+0x297/0x8b0
[10271.692697]  ? ttwu_do_wakeup+0x1e/0x150
[10271.692699]  schedule+0x2c/0x80
[10271.692700]  schedule_preempt_disabled+0xe/0x10
[10271.692702]  __mutex_lock.isra.2+0x18c/0x4d0
[10271.692709]  ? wake_up_q+0x44/0x80
[10271.692714]  __mutex_lock_slowpath+0x13/0x20
[10271.692719]  ? __mutex_lock_slowpath+0x13/0x20
[10271.692723]  mutex_lock+0x2f/0x40
[10271.692728]  css_release_work_fn+0x2b/0x180
[10271.692733]  process_one_work+0x1de/0x410
[10271.692735]  worker_thread+0x32/0x410
[10271.692737]  kthread+0x121/0x140
[10271.692738]  ? process_one_work+0x410/0x410
[10271.692740]  ? kthread_create_worker_on_cpu+0x70/0x70
[10271.692742]  ? do_syscall_64+0x115/0x130
[10271.692743]  ret_from_fork+0x22/0x40
[10271.692752] INFO: task kworker/2:1:128568 blocked for more than 120 seconds.
[10271.699827]       Tainted: P           OE    4.15.0-20-generic #21-Ubuntu
[10271.706653] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10271.714524] kworker/2:1     D    0 128568      2 0x80000000
[10271.714529] Workqueue: cgroup_destroy css_release_work_fn
[10271.714530] Call Trace:
[10271.714534]  __schedule+0x297/0x8b0
[10271.714536]  schedule+0x2c/0x80
[10271.714537]  schedule_preempt_disabled+0xe/0x10
[10271.714539]  __mutex_lock.isra.2+0x18c/0x4d0
[10271.714540]  __mutex_lock_slowpath+0x13/0x20
[10271.714541]  ? __mutex_lock_slowpath+0x13/0x20
[10271.714543]  mutex_lock+0x2f/0x40
[10271.714544]  css_release_work_fn+0x2b/0x180
[10271.714545]  process_one_work+0x1de/0x410
[10271.714547]  worker_thread+0x32/0x410
[10271.714548]  kthread+0x121/0x140
[10271.714550]  ? process_one_work+0x410/0x410
[10271.714551]  ? kthread_create_worker_on_cpu+0x70/0x70
[10271.714552]  ? kthread_create_worker_on_cpu+0x70/0x70
[10271.714554]  ret_from_fork+0x22/0x40
[10271.714564] INFO: task kworker/43:4:46368 blocked for more than 120 seconds.
[10271.721637]       Tainted: P           OE    4.15.0-20-generic #21-Ubuntu
[10271.728456] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10271.736309] kworker/43:4    D    0 46368      2 0x80000000
[10271.736314] Workqueue: cgroup_destroy css_killed_work_fn
[10271.736321] Call Trace:
[10271.736328]  __schedule+0x297/0x8b0
[10271.736330]  schedule+0x2c/0x80
[10271.736332]  schedule_preempt_disabled+0xe/0x10
[10271.736333]  __mutex_lock.isra.2+0x18c/0x4d0
[10271.736336]  ? wake_up_q+0x44/0x80
[10271.736337]  __mutex_lock_slowpath+0x13/0x20
[10271.736338]  ? __mutex_lock_slowpath+0x13/0x20
[10271.736339]  mutex_lock+0x2f/0x40
[10271.736341]  css_killed_work_fn+0x1f/0xe0
[10271.736342]  process_one_work+0x1de/0x410
[10271.736344]  worker_thread+0x32/0x410
[10271.736346]  kthread+0x121/0x140
[10271.736352]  ? process_one_work+0x410/0x410
[10271.736357]  ? kthread_create_worker_on_cpu+0x70/0x70
[10271.736362]  ret_from_fork+0x22/0x40
[10271.736367] INFO: task kworker/13:3:49251 blocked for more than 120 seconds.
[10271.743439]       Tainted: P           OE    4.15.0-20-generic #21-Ubuntu
[10271.750250] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10271.758106] kworker/13:3    D    0 49251      2 0x80000000
[10271.758110] Workqueue: cgroup_destroy css_killed_work_fn
[10271.758111] Call Trace:
[10271.758119]  __schedule+0x297/0x8b0
[10271.758122]  schedule+0x2c/0x80
[10271.758123]  schedule_preempt_disabled+0xe/0x10
[10271.758124]  __mutex_lock.isra.2+0x18c/0x4d0
[10271.758127]  ? wake_up_q+0x44/0x80
[10271.758128]  __mutex_lock_slowpath+0x13/0x20
[10271.758129]  ? __mutex_lock_slowpath+0x13/0x20
[10271.758130]  mutex_lock+0x2f/0x40
[10271.758131]  css_killed_work_fn+0x1f/0xe0
[10271.758133]  process_one_work+0x1de/0x410
[10271.758134]  worker_thread+0x32/0x410
[10271.758136]  kthread+0x121/0x140
[10271.758143]  ? process_one_work+0x410/0x410
[10271.758148]  ? kthread_create_worker_on_cpu+0x70/0x70
[10271.758153]  ret_from_fork+0x22/0x40
[10271.758161] INFO: task kworker/38:1:21823 blocked for more than 120 seconds.
[10271.765236]       Tainted: P           OE    4.15.0-20-generic #21-Ubuntu
[10271.772047] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[10271.779902] kworker/38:1    D    0 21823      2 0x80000000
[10271.779907] Workqueue: cgroup_destroy css_release_work_fn
[10271.779911] Call Trace:
[10271.779918]  __schedule+0x297/0x8b0
[10271.779920]  schedule+0x2c/0x80
[10271.779922]  schedule_preempt_disabled+0xe/0x10
[10271.779923]  __mutex_lock.isra.2+0x18c/0x4d0
[10271.779925]  ? wake_up_q+0x44/0x80
[10271.779926]  __mutex_lock_slowpath+0x13/0x20
[10271.779928]  ? __mutex_lock_slowpath+0x13/0x20
[10271.779929]  mutex_lock+0x2f/0x40
[10271.779930]  css_release_work_fn+0x2b/0x180
[10271.779935]  process_one_work+0x1de/0x410
[10271.779941]  worker_thread+0x32/0x410
[10271.779946]  kthread+0x121/0x140
[10271.779952]  ? process_one_work+0x410/0x410
[10271.779955]  ? kthread_create_worker_on_cpu+0x70/0x70
[10271.779959]  ret_from_fork+0x22/0x40
root@myserver:~#

behlendorf commented 6 years ago

@simos thanks for the additional testing and reporting your results. The hung tasks your now seeing don't appear to be related to ZFS based on the stack traces you posted. So the original issue does appear to be fixed and you've uncovered a new one, which looks to be related to cgroups.

simos commented 6 years ago

@behlendorf Thanks. I did a preliminary report at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1781601 and keep an eye on reproducing during more LXD stress-tests.

mailinglists35 commented 6 years ago

this is to remind myself to check if my issue is solved with the mentioned commit

fling- commented 6 years ago

System information

Type	Version/Name
Distribution Name	gentoo
Distribution Version	17.1
Linux Kernel	4.18.8
Architecture	amd64
ZFS Version	0.7.11
SPL Version	0.7.11

Issue is reproducible each time when starting an lxc container while /var/lib/lxc/rootfs is on a dedup enabled dataset. I had no issues with this setup on 0.7.6 and earlier versions

This could be related: lxc-3.0.2 is needed for lxd to run on 4.18

behlendorf commented 6 years ago

@fling- can you open a new issue with the offending stack traces so we can determine if this is a similar problem. It's not clear to me why dedup would matter here aside from the performance implications.

drescherjm commented 4 years ago

I got this recently: [852786.703564] INFO: task smbd:27186 blocked for more than 120 seconds. [852786.703567] Tainted: P O 4.14.157-gentoo-20191204-1553-datastore4 #1 [852786.703568] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [852786.703570] smbd D 0 27186 21509 0x00000100 [852786.703573] Call Trace: [852786.703582] ? schedule+0x247/0x760 [852786.703584] schedule+0x23/0x70 [852786.703586] io_schedule+0xd/0x40 [852786.703590] wait_on_page_bit+0xd2/0x100 [852786.703593] ? add_to_page_cache_lru+0xb0/0xb0 [852786.703596] filemap_fdatawait_range+0xc5/0x100 [852786.703599] ? __filemap_fdatawrite_range+0x80/0xa0 [852786.703602] filemap_write_and_wait_range+0x36/0x60 [852786.703633] zpl_putpage+0x682/0xf10 [zfs] [852786.703637] do_fsync+0x33/0x70 [852786.703640] SyS_fdatasync+0xa/0x10 [852786.703644] do_syscall_64+0x5b/0xf0 [852786.703648] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [852786.703650] RIP: 0033:0x7f93776fd348 [852786.703651] RSP: 002b:00007fffc707e7d0 EFLAGS: 00000246 ORIG_RAX: 000000000000004b [852786.703654] RAX: ffffffffffffffda RBX: 0000000000003000 RCX: 00007f93776fd348 [852786.703655] RDX: 000000000000805c RSI: 0000000000003000 RDI: 0000000000000019 [852786.703656] RBP: 000000000000805c R08: 0000000000002184 R09: 0000000000000000 [852786.703658] R10: 000000000000030f R11: 0000000000000246 R12: 00005631496fba50 [852786.703659] R13: 00007f93769b2d40 R14: 000056314975db78 R15: 00005631496fba50 [854752.783302] INFO: task smbd:8672 blocked for more than 120 seconds. [854752.783305] Tainted: P O 4.14.157-gentoo-20191204-1553-datastore4 #1 [854752.783305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [854752.783307] smbd D 0 8672 21509 0x00000100 [854752.783311] Call Trace: [854752.783321] ? schedule+0x247/0x760 [854752.783324] ? bpf_prog_run64+0x30/0x30 [854752.783326] schedule+0x23/0x70 [854752.783330] rwsem_down_write_failed+0x131/0x210 [854752.783335] call_rwsem_down_write_failed+0x13/0x20 [854752.783337] down_write+0x24/0x40 [854752.783342] do_unlinkat+0xff/0x2e0 [854752.783347] do_syscall_64+0x5b/0xf0 [854752.783350] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [854752.783353] RIP: 0033:0x7f93776f7747 [854752.783354] RSP: 002b:00007fffc707f7f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000057 [854752.783357] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 00007f93776f7747 [854752.783358] RDX: 0000000000000014 RSI: 00007fffc707f9d0 RDI: 00007fffc707fa12 [854752.783359] RBP: 00007fffc707fa10 R08: 0000000000000001 R09: 0000000000000014 [854752.783360] R10: 00007fffc707f9d0 R11: 0000000000000246 R12: 00007fffc707fa80 [854752.783361] R13: 00005631496cce50 R14: 00005631496dd950 R15: 0000000000000006 [854752.783364] INFO: task smbd:8817 blocked for more than 120 seconds.

This is on datastore4 log # zfs --version zfs-0.8.0-434_g5a0897737 zfs-kmod-0.8.0-434_g5a0897737

datastore4 log # uname -a Linux datastore4 4.14.157-gentoo-20191204-1553-datastore4 #1 SMP Wed Dec 4 15:57:28 EST 2019 x86_64 Intel(R) Xeon(R) CPU E31230 @ 3.20GHz GenuineIntel GNU/Linux

openzfs / zfs

Dedup + LXD leads to permanently hung tasks #7659

System information

When using ZFS-backed LXC/LXD, hung tasks on concurrent container creation

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

7598 and #7648 are also other fun with LXD and ZoL, so you're probably more than two.

System information