Pool Import/Export extremely slow

mattventura commented 3 years ago

System information

I have tried this with both Debian 11 and a FreeBSD 13 LiveCD. Debian is installed on the system, while the FreeBSD LiveCD is an attempt at fixing it.

Type	Version/Name
Distribution Name	Debian
Distribution Version	11
Kernel Version	5.14
Architecture	AMD64
OpenZFS Version	zfs-2.0.6-1
---	---
Distribution Name	FreeBSD
Distribution Version	13-RELEASE
Kernel Version	1300139
Architecture	AMD64
OpenZFS Version	2.0.0-FreeBSD_gf11b09dec

Describe the problem you're observing

When I zpool import tank3 (no -F or other recovery options specified), it takes about an hour. At first, there is a reasonable amount of disk activity (~100MB/s per drive). After a while, it drops to very low numbers (75-200kb/s), though there is still some activity: iostat

Judging by the suspiciously low queue depths and low b%, perhaps it's trying to do one operation at a time, synchronously?

Sometimes the speed goes up (note the normal-looking queue lengths and high b%), but falls back down:

iostat2

Finally, after about an hour, I see write activity for the drive, and the import finishes:

iostat3

Then, I try to zpool export tank3, thinking that's the end of it, and we see the same thing - qlen of 1 on a single drive, 0 on the others, and <200KB/s:

iostat4

Eventually gets past that point and goes back to normal IO rates, but then tells me that it can't export because the pool is busy.

The pool is 2x4TB mirror + 2x8TB mirror. There were also some cache drives, but I removed those to try to isolate the problem. As mentioned, I tried it both with the Debian that is actually installed on the server, as well as a FreeBSD live CD to isolate software issues. I also tried switching from the LSI SAS card I was using to onboard SATA, same results. I tried importing with only one drive from each mirror present, but I got the same results, thus ruling out a single faulty drive. Also, if I dd from the drives while the import is still going, it transfers at a decent rate (~40MB/s), so it doesn't seem to be a drive issue.

If I do zpool import readonly=on tank3, it imports the pool after just a few minutes. Still longer than a normal pool would take, but at least the data is recoverable.

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Unsure of how I got to this point - I had an unclean shutdown due to a power outage, however, even before that, I noticed the pool getting a little slow.

On FreeBSD, there is nothing suspect in dmesg nor /var/log/messages. On Linux, the behavior is mostly the same, though I also get "hung task" warnings in dmesg.

rincebrain commented 3 years ago

Knowing where and which task is hung could be informative.

It only taking a few minutes on readonly import is informative, but not exhaustively.

What does cat /proc/spl/kstat/zfs/dbgmsg (on Linux) or sysctl kstat.zfs.misc.dbgmsg (on FBSD) say while it's hanging?

mattventura commented 3 years ago

Took this when it got stuck. sdn3, sdl3, and sdm3 are the missing cache devices.

kstat.zfs.misc.dbgmsg:
timestamp    message
1635283058   spa.c:6146:spa_tryimport(): spa_tryimport: importing tank3
1635283058   spa_misc.c:411:spa_load_note(): spa_load($import, config trusted): LOADING
1635283058   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/gptid/06c3d510-881d-0d49-a20c-6ffd870190e6': best uberblock found for spa $import. txg 9831233
1635283058   spa_misc.c:411:spa_load_note(): spa_load($import, config untrusted): using uberblock with txg=9831233
1635283059   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/sdn3': vdev_geom_open: failed to open [error=2]
1635283059   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/sdl3': vdev_geom_open: failed to open [error=2]
1635283059   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/sdm3': vdev_geom_open: failed to open [error=2]
1635283060   spa.c:8246:spa_async_request(): spa=$import async request task=2048
1635283060   spa_misc.c:411:spa_load_note(): spa_load($import, config trusted): LOADED
1635283060   spa_misc.c:411:spa_load_note(): spa_load($import, config trusted): UNLOADING
1635283315   spa.c:6146:spa_tryimport(): spa_tryimport: importing tank3
1635283315   spa_misc.c:411:spa_load_note(): spa_load($import, config trusted): LOADING
1635283315   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/gptid/06c3d510-881d-0d49-a20c-6ffd870190e6': best uberblock found for spa $import. txg 9831233
1635283315   spa_misc.c:411:spa_load_note(): spa_load($import, config untrusted): using uberblock with txg=9831233
1635283315   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/sdn3': vdev_geom_open: failed to open [error=2]
1635283316   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/sdl3': vdev_geom_open: failed to open [error=2]
1635283316   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/sdm3': vdev_geom_open: failed to open [error=2]
1635283317   spa.c:8246:spa_async_request(): spa=$import async request task=2048
1635283317   spa_misc.c:411:spa_load_note(): spa_load($import, config trusted): LOADED
1635283317   spa_misc.c:411:spa_load_note(): spa_load($import, config trusted): UNLOADING
1635283317   spa.c:5998:spa_import(): spa_import: importing tank3
1635283317   spa_misc.c:411:spa_load_note(): spa_load(tank3, config trusted): LOADING
1635283317   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/gptid/06c3d510-881d-0d49-a20c-6ffd870190e6': best uberblock found for spa tank3. txg 9831233
1635283317   spa_misc.c:411:spa_load_note(): spa_load(tank3, config untrusted): using uberblock with txg=9831233
1635283318   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/sdn3': vdev_geom_open: failed to open [error=2]
1635283318   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/sdl3': vdev_geom_open: failed to open [error=2]
1635283318   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/sdm3': vdev_geom_open: failed to open [error=2]
1635283324   spa_misc.c:411:spa_load_note(): spa_load(tank3, config trusted): read 19 log space maps (752 total blocks - blksz = 131072 bytes) in 4379 ms
1635283383   mmp.c:241:mmp_thread_start(): MMP thread started pool 'tank3' gethrtime 382466674898
1635283383   metaslab.c:2434:metaslab_load_impl(): metaslab_load: txg 9831234, spa tank3, vdev_id 1, ms_id 72, smp_length 180680, unflushed_allocs 148967424, unflushed_frees 185036800, freed 0, defer 0 + 0, unloaded time 382466 ms, loading_time 58 ms, ms_max_size 130080768, max size error 126672896, old_weight 680000000000009, new_weight 680000000000009
1635283383   metaslab.c:2434:metaslab_load_impl(): metaslab_load: txg 9831234, spa tank3, vdev_id 0, ms_id 107, smp_length 2094784, unflushed_allocs 1340043264, unflushed_frees 462532608, freed 0, defer 0 + 0, unloaded time 382525 ms, loading_time 213 ms, ms_max_size 1454874624, max size error 1449185280, old_weight 780000000000001, new_weight 780000000000001
1635283384   metaslab.c:2434:metaslab_load_impl(): metaslab_load: txg 9831234, spa tank3, vdev_id 0, ms_id 128, smp_length 1665536, unflushed_allocs 1850593280, unflushed_frees 1198854144, freed 0, defer 0 + 0, unloaded time 382739 ms, loading_time 282 ms, ms_max_size 1129746432, max size error 1129664512, old_weight 780000000000001, new_weight 780000000000001
1635283384   metaslab.c:2434:metaslab_load_impl(): metaslab_load: txg 9831234, spa tank3, vdev_id 1, ms_id 208, smp_length 480648, unflushed_allocs 1892352, unflushed_frees 235479040, freed 0, defer 0 + 0, unloaded time 383022 ms, loading_time 127 ms, ms_max_size 115335168, max size error 112971776, old_weight 680000000000004, new_weight 680000000000004
1635283384   metaslab.c:2434:metaslab_load_impl(): metaslab_load: txg 9831234, spa tank3, vdev_id 0, ms_id 130, smp_length 668360, unflushed_allocs 866045952, unflushed_frees 336076800, freed 0, defer 0 + 0, unloaded time 383150 ms, loading_time 92 ms, ms_max_size 1216274432, max size error 1214455808, old_weight 780000000000001, new_weight 780000000000001
1635283384   metaslab.c:2434:metaslab_load_impl(): metaslab_load: txg 9831234, spa tank3, vdev_id 0, ms_id 123, smp_length 1886648, unflushed_allocs 5853184, unflushed_frees 124325888, freed 0, defer 0 + 0, unloaded time 383243 ms, loading_time 158 ms, ms_max_size 626511872, max size error 626356224, old_weight 740000000000001, new_weight 740000000000001
1635283384   metaslab.c:2434:metaslab_load_impl(): metaslab_load: txg 9831234, spa tank3, vdev_id 1, ms_id 46, smp_length 971416, unflushed_allocs 19730432, unflushed_frees 304754688, freed 0, defer 0 + 0, unloaded time 383401 ms, loading_time 145 ms, ms_max_size 90030080, max size error 89354240, old_weight 680000000000001, new_weight 680000000000001
1635283384   metaslab.c:2434:metaslab_load_impl(): metaslab_load: txg 9831234, spa tank3, vdev_id 0, ms_id 375, smp_length 386632, unflushed_allocs 12730368, unflushed_frees 20074496, freed 0, defer 0 + 0, unloaded time 383547 ms, loading_time 37 ms, ms_max_size 561119232, max size error 559644672, old_weight 740000000000001, new_weight 740000000000001
1635283384   metaslab.c:2434:metaslab_load_impl(): metaslab_load: txg 9831234, spa tank3, vdev_id 1, ms_id 93, smp_length 1420736, unflushed_allocs 2346090496, unflushed_frees 564330496, freed 0, defer 0 + 0, unloaded time 383585 ms, loading_time 233 ms, ms_max_size 72355840, max size error 72183808, old_weight 680000000000001, new_weight 680000000000001
1635283384   metaslab.c:3606:metaslab_condense(): condensing: txg 9831234, msp[123] 0xfffff80cc1fa7000x, vdev id 0, spa tank3, smp size 1886648, segments 109518, forcing condense=FALSE
1635283385   metaslab.c:2434:metaslab_load_impl(): metaslab_load: txg 9831234, spa tank3, vdev_id 1, ms_id 186, smp_length 808360, unflushed_allocs 1076891648, unflushed_frees 436248576, freed 20480, defer 0 + 0, unloaded time 383824 ms, loading_time 201 ms, ms_max_size 106373120, max size error 105832448, old_weight 680000000000001, new_weight 680000000000001
1635283385   metaslab.c:2434:metaslab_load_impl(): metaslab_load: txg 9831234, spa tank3, vdev_id 0, ms_id 82, smp_length 359280, unflushed_allocs 153624576, unflushed_frees 152027136, freed 4096, defer 0 + 0, unloaded time 384025 ms, loading_time 56 ms, ms_max_size 399994880, max size error 350654464, old_weight 700000000000003, new_weight 700000000000003
1635283385   metaslab.c:2434:metaslab_load_impl(): metaslab_load: txg 9831234, spa tank3, vdev_id 1, ms_id 219, smp_length 637960, unflushed_allocs 296792064, unflushed_frees 1563930624, freed 0, defer 0 + 0, unloaded time 384082 ms, loading_time 83 ms, ms_max_size 91815936, max size error 56557568, old_weight 680000000000001, new_weight 680000000000001
1635283385   metaslab.c:2434:metaslab_load_impl(): metaslab_load: txg 9831234, spa tank3, vdev_id 1, ms_id 231, smp_length 2241192, unflushed_allocs 151257088, unflushed_frees 891867136, freed 0, defer 0 + 0, unloaded time 384165 ms, loading_time 358 ms, ms_max_size 101302272, max size error 98713600, old_weight 680000000000001, new_weight 680000000000001
1635283385   spa.c:8246:spa_async_request(): spa=tank3 async request task=1
1635283385   spa.c:8246:spa_async_request(): spa=tank3 async request task=2048
1635283385   spa_misc.c:411:spa_load_note(): spa_load(tank3, config trusted): LOADED
1635283385   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/sdn3': vdev_geom_open: failed to open [error=2]
1635283385   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/sdl3': vdev_geom_open: failed to open [error=2]
1635283385   vdev.c:131:vdev_dbgmsg(): disk vdev '/dev/sdm3': vdev_geom_open: failed to open [error=2]
1635283385   spa_history.c:309:spa_history_log_sync(): txg 9831235 open pool version 5000; software version zfs-0.7.0-3175-g184df27ee; uts  13.0-RELEASE 1300139 amd64
1635283385   bptree.c:232:bptree_iterate(): bptree index 0: traversing from min_txg=1 bookmark -1/1/0/5382599
1635283411   dsl_scan.c:3423:dsl_process_async_destroys(): freed 19020 blocks in 26169ms from free_bpobj/bptree txg 9831235; err=4294967295

mattventura commented 3 years ago

Interesting, after it successfully imported, a simple zpool status also hanged, doing the same ~200KB/s reading. The 200KB/s seems to persist even after it finishes. Even simple commands like zpool iostat -r tank3 take a while.

Here is the complete log:

in2.txt

Dmesg showing the stack traces for the hung tasks:

dmesg_stuff.txt

One more thing: I tried kicking off a manual zpool scrub, but it's stuck at 0 bytes - does it not scrub if there's stuff to free?

rincebrain commented 3 years ago

it may take a bit, but does zpool events -v say anything interesting?

It seems like it's trying to process a whole lot of async_destroy work that was queued up, and your pool isn't handling the load well. You could make it stop making any progress by twiddling the zfs_free_min_time_ms parameter smaller or to zero, but at 0, it's not guaranteed to make progress over time...

mattventura commented 3 years ago

I've attached my zpool events.

It does look like it might be 'freeing'-related, since zpool get freeing showed over 200GB which has now worked its way down to 44GB. zp_events.txt

rincebrain commented 3 years ago

I mean, yeah, it was logging things about freeing things in dbgmsg.

mattventura commented 3 years ago

Okay, the pool seems to have recovered after chewing through the frees.

The only thing I can think of that would have rapidly generated a bunch of space that needed to be freed would have been when I was experimenting with some VM scripts and thus was quickly creating and destroying lots of non-sparse 100GB zvols. However, only ~5GB would have actually been used by them.

mattventura commented 3 years ago

Ouch, the system that this one replicates to also seems to be having the issue now. Same symptoms, though queue size seems to at least be 10 this time rather than 1 (it is a single disk).

I decided to try the suggestion mentioned in #12693 of setting zfs_free_min_time_ms to zero - this doesn't seem to have done anything.

rincebrain commented 3 years ago

That's curious - it definitely helped pretty immediately for me, but I also already had the pool imported, and was blocking on commands when trying to initially issue the destroys.

That said, there are a number of things a pool can think it has to block and do synchronously on import - a block device flamegraph might be informative?

What does dbgmsg say on the affected system?

mattventura commented 3 years ago

Sadly, I already rebooted it so I don't have the dbgmsg (same for zpool events), but I did see the same "hung task" warnings in dmesg.

rincebrain commented 3 years ago

Well, if it does the same thing on import each time, it's simple enough to recreate, I'd imagine, when you're okay with doing so.

If it doesn't, that just raises further questions.

I'd be curious to know what zfs get -r -t filesystem,volume dedup,compression [pool] | grep -v default, zpool get freeing [pool] and zpool status -D [pool] said on both the old and new pool, though obviously the new pool isn't going to fly unless you get it imported again...

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

behlendorf commented 2 years ago

The zfs-2.1.6 release includes some significant work to improve pool import times (PR #12789). This may resolve the problem you were seeing.

openzfs / zfs