Open shuther opened 1 year ago
When the pool is exported, try running zdb -bcc -e voluz
and tell us what it outputs. The command can take a significant amount of time to run, since it is basically doing a scrub in userspace.
In particular, I am interested in knowing if it detects any spacemap/metaslab corruption on your pool.
I will launch it today, in the interim, I identified other kernel panics over the last few weeks:
Apr 21 17:45:41 freenas processor eflags = interrupt enabled, resume, IOPL = 0
Apr 21 17:45:41 freenas current process = 0 (z_null_int)
Apr 21 17:45:41 freenas trap number = 9
Apr 21 17:45:41 freenas panic: general protection fault
Apr 21 17:45:41 freenas cpuid = 1
Apr 21 17:45:41 freenas time = 1682091688
Apr 21 17:45:41 freenas KDB: stack backtrace:
Apr 21 17:45:41 freenas db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0124e7cb40
Apr 21 17:45:41 freenas vpanic() at vpanic+0x17f/frame 0xfffffe0124e7cb90
Apr 21 17:45:41 freenas panic() at panic+0x43/frame 0xfffffe0124e7cbf0
Apr 21 17:45:41 freenas trap_fatal() at trap_fatal+0x385/frame 0xfffffe0124e7cc50
Apr 21 17:45:41 freenas calltrap() at calltrap+0x8/frame 0xfffffe0124e7cc50
Apr 21 17:45:41 freenas --- trap 0x9, rip = 0xffffffff828b1a98, rsp = 0xfffffe0124e7cd20, rbp = 0xfffffe0124e7cd20 ---
Apr 21 17:45:41 freenas list_prev() at list_prev+0x8/frame 0xfffffe0124e7cd20
Apr 21 17:45:41 freenas l2arc_write_done() at l2arc_write_done+0x9b/frame 0xfffffe0124e7cd90
Apr 21 17:45:41 freenas zio_done() at zio_done+0x82a/frame 0xfffffe0124e7ce00
Apr 21 17:45:41 freenas zio_execute() at zio_execute+0x9f/frame 0xfffffe0124e7ce40
Apr 21 17:45:41 freenas taskqueue_run_locked() at taskqueue_run_locked+0x181/frame 0xfffffe0124e7cec0
Apr 21 17:45:41 freenas taskqueue_thread_loop() at taskqueue_thread_loop+0xc2/frame 0xfffffe0124e7cef0
Apr 21 17:45:41 freenas fork_exit() at fork_exit+0x7e/frame 0xfffffe0124e7cf30
Apr 21 17:45:41 freenas fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0124e7cf30
Apr 21 17:45:41 freenas --- trap 0x80af7284, rip = 0, rsp = 0, rbp = 0 ---
Apr 21 17:45:41 freenas KDB: enter: panic
Apr 21 17:45:41 freenas ---<<BOOT>>---
The first kernel panics started in April 10 with the VERIFY3 issue.
I am surprised by the speed (very slow), a full scrub used to take about 30 hours to complete. Is it expected?
root@freenas:~ # zdb -bcc -e voluz
Traversing all blocks to verify checksums and verify nothing leaked ...
loading concrete vdev 1, metaslab 30 of 31 .....
8.28G completed ( 12MB/s) estimated time remaining: 728hr 37min 10sec
EDITED: It seems just the start was slow and it is catching up. Also is it expected that we have 30/31 and not 31/31?
587G completed ( 99MB/s) estimated time remaining: 86hr 43min 05sec
Also, the boot process seems much slower than before (more than 5 min). I have a lot of messages in the console.log; not sure if it is relevant. Time is confusing in this log (not sure if it is a bug or not) because it is showing as if everything happened at 09:05:43
Apr 25 09:05:43 freenas Beginning pools import
Apr 25 09:05:43 freenas Importing voluz
Apr 25 09:05:43 freenas spa_misc.c:419:spa_load_note(): spa_load(voluz, config trusted): read 9 log space maps (9 total blocks - blksz = 131072 bytes
) in 39 ms
Apr 25 09:05:43 freenas mmp.c:241:mmp_thread_start(): MMP thread started pool 'voluz' gethrtime 74516342851
Apr 25 09:05:43 freenas metaslab.c:2457:metaslab_load_impl(): metaslab_load: txg 30341719, spa voluz, vdev_id 0, ms_id 119, smp_length 3974440, unflu
shed_allocs 270336, unflushed_frees 258048, freed 0, defer 0 + 0, unloaded time 74516 ms, loading_time 152 ms, ms_max_size 35223552
Apr 25 09:05:43 freenas metaslab.c:2457:metaslab_load_impl(): metaslab_load: txg 30341719, spa voluz, vdev_id 0, ms_id 155, smp_length 1193528, unflu
shed_allocs 147456, unflushed_frees 135168, freed 0, defer 0 + 0, unloaded time 74668 ms, loading_time 80 ms, ms_max_size 214401024
Apr 25 09:05:43 freenas metaslab.c:2457:metaslab_load_impl(): metaslab_load: txg 30341719, spa voluz, vdev_id 0, ms_id 84, smp_length 2053088, unflus
hed_allocs 196608, unflushed_frees 172032, freed 0, defer 0 + 0, unloaded time 74749 ms, loading_time 98 ms, ms_max_size 180756480,
Apr 25 09:05:43 freenas metaslab.c:2457:metaslab_load_impl(): metaslab_load: txg 30341719, spa voluz, vdev_id 0, ms_id 125, smp_length 2015368, unflu
shed_allocs 73728, unflushed_frees 73728, freed 0, defer 0 + 0, unloaded time 74848 ms, loading_time 89 ms, ms_max_size 252051456,
Apr 25 09:05:43 freenas metaslab.c:2457:metaslab_load_impl(): metaslab_load: txg 30341719, spa voluz, vdev_id 0, ms_id 118, smp_length 2120864, unflu
shed_allocs 184320, unflushed_frees 184320, freed 0, defer 0 + 0, unloaded time 74938 ms, loading_time 89 ms, ms_max_size 193118208
Apr 25 09:05:43 freenas metaslab.c:2457:metaslab_load_impl(): metaslab_load: txg 30341719, spa voluz, vdev_id 0, ms_id 167, smp_length 5121112, unflu
shed_allocs 577536, unflushed_frees 577536, freed 0, defer 0 + 0, unloaded time 75027 ms, loading_time 164 ms, ms_max_size 19095552
Apr 25 09:05:43 freenas metaslab.c:2457:metaslab_load_impl(): metaslab_load: txg 30341719, spa voluz, vdev_id 0, ms_id 130, smp_length 4272288, unflu
shed_allocs 49152, unflushed_frees 73728, freed 0, defer 0 + 0, unloaded time 75192 ms, loading_time 154 ms, ms_max_size 153452544,
Apr 25 09:05:43 freenas metaslab.c:2457:metaslab_load_impl(): metaslab_load: txg 30341719, spa voluz, vdev_id 0, ms_id 164, smp_length 1924536, unflu
shed_allocs 49152, unflushed_frees 86016, freed 0, defer 0 + 0, unloaded time 75346 ms, loading_time 81 ms, ms_max_size 174194688,
Apr 25 09:05:43 freenas metaslab.c:2457:metaslab_load_impl(): metaslab_load: txg 30341719, spa voluz, vdev_id 0, ms_id 82, smp_length 1642416, unflus
hed_allocs 946176, unflushed_frees 946176, freed 0, defer 0 + 0, unloaded time 75427 ms, loading_time 66 ms, ms_max_size 135413760,
Apr 25 09:05:43 freenas spa.c:8369:spa_async_request(): spa=voluz async request task=1
Apr 25 09:05:43 freenas spa.c:8369:spa_async_request(): spa=voluz async request task=2048
Apr 25 09:05:43 freenas spa_misc.c:419:spa_load_note(): spa_load(voluz, config trusted): LOADED
Apr 25 09:05:43 freenas spa.c:8369:spa_async_request(): spa=voluz async request task=4096
Apr 25 09:05:43 freenas spa_history.c:311:spa_history_log_sync(): txg 30341720 open pool version 5000; software version zfs-9ef0b67f8; uts 13.1-RELE
ASE-p7 1301000 amd64
Apr 25 09:05:43 freenas spa.c:8369:spa_async_request(): spa=voluz async request task=32
Apr 25 09:05:43 freenas metaslab.c:2457:metaslab_load_impl(): metaslab_load: txg 30341722, spa voluz, vdev_id 0, ms_id 174, smp_length 39000, unflush
ed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 75705 ms, loading_time 157 ms, ms_max_size 24310640640, max siz
Apr 25 09:05:43 freenas metaslab.c:2457:metaslab_load_impl(): metaslab_load: txg 30341722, spa voluz, vdev_id 0, ms_id 97, smp_length 2343400, unflus
hed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 75705 ms, loading_time 170 ms, ms_max_size 144162816, max size
Apr 25 09:05:43 freenas spa_history.c:311:spa_history_log_sync(): txg 30341723 import pool version 5000; software version zfs-9ef0b67f8; uts 13.1-RE
LEASE-p7 1301000 amd64
Apr 25 09:05:43 freenas metaslab.c:2457:metaslab_load_impl(): metaslab_load: txg 30341723, spa voluz, vdev_id 0, ms_id 112, smp_length 2847736, unflu
shed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 75876 ms, loading_time 147 ms, ms_max_size 145428480, max siz
Apr 25 09:05:43 freenas metaslab.c:2457:metaslab_load_impl(): metaslab_load: txg 30341723, spa voluz, vdev_id 0, ms_id 122, smp_length 4298648, unflu
shed_allocs 0, unflushed_frees 0, freed 0, defer 0 + 0, unloaded time 76024 ms, loading_time 198 ms, ms_max_size 151326720, max siz
Apr 25 09:05:43 freenas dsl_scan.c:3705:dsl_scan_sync(): doing scan sync txg 30341724; bm=44946/8489/0/0
...
Apr 25 09:05:43 freenas dsl_scan.c:2464:dsl_scan_visitds(): scanned dataset 44946 (voluz/media/movies@auto-2023-04-23-00-00) with min=1 max=30333134;
suspending=1
Apr 25 09:05:43 freenas dsl_scan.c:3739:dsl_scan_sync(): scan visited 21 blocks in 6869ms (1 os's, 6 holes, 0 < mintxg, 0 in ddt, 0 > maxtxg)
Apr 25 09:05:43 freenas dsl_scan.c:3705:dsl_scan_sync(): doing scan sync txg 30341725; bm=44946/8489/0/1
...
Apr 25 09:05:43 freenas dsl_scan.c:3739:dsl_scan_sync(): scan visited 1373063 blocks in 6273ms (1 os's, 9709 holes, 0 < mintxg, 0 in ddt, 0 > maxtxg)
Apr 25 09:05:43 freenas dsl_scan.c:3663:dsl_scan_sync(): begin scan clearing
Apr 25 09:05:43 freenas dsl_scan.c:3772:dsl_scan_sync(): scan issued 9733 blocks (1 segs) in 1015ms (avg_block_size = 122311, avg_seg_size = 11898839
04)
Apr 25 09:05:43 freenas dsl_scan.c:3772:dsl_scan_sync(): scan issued 9163 blocks (1 segs) in 1019ms (avg_block_size = 122295, avg_seg_size = 11211939
84)
...
@shuther Just to be sure, do you have ECC RAM or, if it is a desktop hardware, at very least run good long memory test recently? Random unrelated panics may speak about memory corruptions, and ones can be from hardware.
no I don't. I used memtest recently and no issue found but maybe it was not running for long enough (just 40 min) ? While it sounds random, it is nearly once a day with always the same error.
So far, I recieved one error (nearly completed); would you know how to get it fixed (should a scrub be enough?) Is there a way to identify the file(s)/snapshots that could be impacted?
9.52T completed ( 95MB/s) estimated time remaining: 63hr 03min 48sec zdb_blkptr_cb: Got error 97 reading <20789, 1, 0, fb36> -- skipping
29.2T completed ( 85MB/s) estimated time remaining: 3hr 22min 21sec
Finally done, only one error:
Error counts:
errno count
97 1
No leaks (block sum matches space maps exactly)
bp count: 362862762
ganged count: 0
bp logical: 19842334715904 avg: 54682
bp physical: 17439185767936 avg: 48060 compression: 1.14
bp allocated: 33161980162048 avg: 91389 compression: 0.60
bp deduped: 0 ref>1: 0 deduplication: 1.00
Normal class: 33161965473792 used: 67.02%
Embedded log class 10739712 used: 0.00%
additional, non-pointer bps of type 0: 5353617
Dittoed blocks on same vdev: 8923848
System information
Describe the problem you're observing
Truenas is rebooting, no specific action on my side (no send/receive), seems random. I have no encrypted pools.
Describe how to reproduce the problem
seems random.
Include any warning/errors/backtraces from the system logs
I have one error in zpool status and i will go through a scrub.