kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
527 stars 239 forks source link

zoned device goes read only with lots of writes #779

Open Motophan opened 2 months ago

Motophan commented 2 months ago

If I add a lot of large files to a zoned hm smr hdd the drive will eventually go read only. If I slow the files down considerably (to like 20MB/s) it wont go read only or it will take much longer to go read only.

Sometimes lsblk will show the drive as 0 bytes, sometimes it will show the full device capacity

Sometimes I will see "lowering kernel perf" in dmesg before it craps out

I have 32 of these drives and every single one of them will do this regardless of cables or computer. I have tried in arch which should have the latest btrfs-progs and ubuntu lts.

These are ST14000NM0007

I think that the drive is too busy updating data to do the zoned part and then craps out, idk.

What do you need from me to get enough data to fix it? I can get them into read only after maybe 200GB-2TB of data blasted at them full speed

adam900710 commented 2 months ago

dmesg please.

kdave commented 2 months ago

What kernel version do you use? If the read-only change happens with lower IO then it's most likely caused by zone reclaim that is able to catch up with the IO, while it does not otherwise. This is still subject to tuning, there are threshold for automatic cleanin and a tunable setting for the actual value of zone unusable space that would trigger that.

The dmesg would tell us more in case there's e.g. a transaction abort or similar unexpected case. I think ENOSPC due to lack of space due to zones should not turn the filesystem read-only, so this may be a real bug. CC @morbidrsa @naota

Motophan commented 2 months ago

What I did 1) start rclone move /mnt/hdd10 /mnt/hdd01 --transfers 1 --checkers 1 --bwlimit 100M -vP Why? This limits the speed to 100M and only uses one thread for moving 2) wait ~ 2 hours 3) profit (it went RO)

smartctl -A says no pending or realloc sectors hdd10 is a standard (not zoned) xfs drive, hdd01 is a btrfs zoned hm smr drive

[42826.019350] kworker/u65:8: vmalloc error: size 0, failed to allocate pages, mode:0x10dc2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_NORETRY|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
[42826.019384] CPU: 2 PID: 76227 Comm: kworker/u65:8 Tainted: P           OE      6.8.7-arch1-1 #1 cb8440eaa48704794690ea311c777c18c4e95af9
[42826.019389] Hardware name: ASUS System Product Name/ROG CROSSHAIR VIII HERO (WI-FI), BIOS 4702 10/20/2023
[42826.019394] Workqueue: writeback wb_workfn (flush-btrfs-5)
[42826.019407] Call Trace:
[42826.019411]  <TASK>
[42826.019417]  dump_stack_lvl+0x64/0x80
[42826.019426]  warn_alloc+0x165/0x1e0
[42826.019433]  ? srso_alias_return_thunk+0x5/0xfbef5
[42826.019439]  ? alloc_pages_mpol+0x95/0x1f0
[42826.019444]  __vmalloc_node_range+0x850/0x8f0
[42826.019452]  ? sd_zbc_report_zones+0xe3/0x390
[42826.019462]  ? __pfx_copy_zone_info_cb+0x10/0x10 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019558]  __vmalloc_node+0x4e/0x70
[42826.019562]  ? sd_zbc_report_zones+0xe3/0x390
[42826.019564]  sd_zbc_report_zones+0xe3/0x390
[42826.019567]  ? srso_alias_return_thunk+0x5/0xfbef5
[42826.019571]  btrfs_get_dev_zones+0xb2/0x1f0 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019599]  btrfs_load_block_group_zone_info+0x248/0xe70 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019626]  ? kmalloc_trace+0x145/0x340
[42826.019631]  btrfs_make_block_group+0xdf/0x2b0 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019661]  btrfs_create_chunk+0x861/0xe10 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019697]  btrfs_chunk_alloc+0x165/0x620 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019725]  find_free_extent+0xc6a/0x15e0 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019756]  btrfs_reserve_extent+0x15d/0x290 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019787]  cow_file_range+0x1d2/0x770 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019820]  run_delalloc_cow+0x94/0xd0 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019849]  btrfs_run_delalloc_range+0x131/0x420 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019875]  ? srso_alias_return_thunk+0x5/0xfbef5
[42826.019877]  ? find_lock_delalloc_range+0x1f0/0x220 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019909]  writepage_delalloc+0xaa/0x140 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019938]  extent_write_cache_pages+0x29c/0x810 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019970]  extent_writepages+0x89/0x130 [btrfs dabdb469baf711f62beb7c5fc8b37adbfc70e1b6]
[42826.019997]  do_writepages+0xcf/0x1e0
[42826.020002]  ? srso_alias_return_thunk+0x5/0xfbef5
[42826.020004]  ? __slab_free+0xdf/0x320
[42826.020008]  __writeback_single_inode+0x3d/0x360
[42826.020014]  writeback_sb_inodes+0x1ed/0x4b0
[42826.020019]  __writeback_inodes_wb+0x4c/0xf0
[42826.020021]  ? srso_alias_return_thunk+0x5/0xfbef5
[42826.020023]  wb_writeback+0x298/0x310
[42826.020027]  wb_workfn+0x2c2/0x510
[42826.020030]  ? __schedule+0x3ee/0x1520
[42826.020035]  process_one_work+0x17b/0x350
[42826.020041]  worker_thread+0x30f/0x450
[42826.020045]  ? __pfx_worker_thread+0x10/0x10
[42826.020046]  kthread+0xe8/0x120
[42826.020051]  ? __pfx_kthread+0x10/0x10
[42826.020054]  ret_from_fork+0x34/0x50
[42826.020059]  ? __pfx_kthread+0x10/0x10
[42826.020061]  ret_from_fork_asm+0x1b/0x30
[42826.020068]  </TASK>
[42826.020070] Mem-Info:
[42826.020072] active_anon:18404421 inactive_anon:10635888 isolated_anon:0
                active_file:23021 inactive_file:1451049 isolated_file:0
                unevictable:770 dirty:239882 writeback:32653
                slab_reclaimable:711259 slab_unreclaimable:1038366
                mapped:27789795 shmem:28902842 pagetables:59393
                sec_pagetables:0 bounce:0
                kernel_misc_reclaimable:0
                free:227148 free_pcp:0 free_cma:0
[42826.020077] Node 0 active_anon:73617684kB inactive_anon:42543552kB active_file:92084kB inactive_file:5804196kB unevictable:3080kB isolated(anon):0kB isolated(file):0kB mapped:111159180kB dirty:959528kB writeback:130612kB shmem:115611368kB shmem_thp:0kB shmem_pmdmapped:0kB anon_thp:151552kB writeback_tmp:0kB kernel_stack:11584kB pagetables:237572kB sec_pagetables:0kB all_unreclaimable? no
[42826.020081] Node 0 DMA free:11260kB boost:0kB min:4kB low:16kB high:28kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[42826.020086] lowmem_reserve[]: 0 3149 128679 128679 128679
[42826.020090] Node 0 DMA32 free:503764kB boost:0kB min:1652kB low:4876kB high:8100kB reserved_highatomic:0KB active_anon:1669164kB inactive_anon:549724kB active_file:668kB inactive_file:396572kB unevictable:0kB writepending:4kB present:3310032kB managed:3242452kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[42826.020094] lowmem_reserve[]: 0 0 125530 125530 125530
[42826.020097] Node 0 Normal free:393568kB boost:484500kB min:550420kB low:678960kB high:807500kB reserved_highatomic:6144KB active_anon:71948520kB inactive_anon:41993828kB active_file:91416kB inactive_file:5407624kB unevictable:3080kB writepending:1090136kB present:130796544kB managed:128550024kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[42826.020101] lowmem_reserve[]: 0 0 0 0 0
[42826.020105] Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 0*1024kB 1*2048kB (M) 2*4096kB (M) = 11260kB
[42826.020119] Node 0 DMA32: 336*4kB (UME) 253*8kB (UME) 402*16kB (UME) 344*32kB (UME) 219*64kB (UME) 150*128kB (UME) 113*256kB (UME) 78*512kB (UME) 50*1024kB (UE) 7*2048kB (UM) 77*4096kB (UM) = 503816kB
[42826.020134] Node 0 Normal: 1419*4kB (UME) 1292*8kB (UME) 5205*16kB (UME) 5828*32kB (UME) 1582*64kB (ME) 62*128kB (ME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 394972kB
[42826.020146] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[42826.020148] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[42826.020149] 30377754 total pagecache pages
[42826.020150] 507 pages in swap cache
[42826.020151] Free swap  = 2543308kB
[42826.020152] Total swap = 4194300kB
[42826.020152] 33530643 pages RAM
[42826.020153] 0 pages HighMem/MovableOnly
[42826.020154] 578684 pages reserved
[42826.020155] 0 pages cma reserved
[42826.020155] 0 pages hwpoisoned
[42826.020167] BTRFS error (device sda): zoned: failed to read zone 3998346117120 on /dev/sda (devid 1)
[42826.020339] BTRFS error (device sda: state A): Transaction aborted (error -12)
[42826.020410] BTRFS: error (device sda: state A) in find_free_extent_update_loop:4208: errno=-12 Out of memory
[42826.020488] BTRFS info (device sda: state EA): forced readonly
[46686.681836] systemd-journald[504]: Under memory pressure, flushing caches.
[46689.024553] systemd-journald[504]: Under memory pressure, flushing caches.
[49625.958746] systemd-journald[504]: Under memory pressure, flushing caches.

I see OOM issue if I am understanding this I am running out of system ram? Should I redo w/ a pagefile of like 100GB or something? I have zram0 2GB on arch but no pagefile and 128GB of ram.

morbidrsa commented 2 months ago

vmalloc of size 0 sounds suspiciously broken. I’ll look into it

morbidrsa commented 2 months ago

So this get's interesting:

 btrfs_get_dev_zone()
    unsigned int nr_zones = 1;
      `->  btrfs_get_dev_zones(/*...*/, &nr_zones);
            `-> blkdev_report_zones(/* ... */, *nr_zones, /* ... */);
               `-> disk->fops->report_zones(/* ... */, nr_zones, /* ... */);
                   `-> sd_zbc_report_zones(/* ... */, nr_zones, /* ... */);
                       `-> sd_zbc_alloc_report_buffer(/* ... */, nr_zones, /* ... */);
                               nr_zones = min(nr_zones, sdkp->zone_info.nr_zones);          // nr_zones = 1
                               bufsize = roundup((nr_zones + 1) * 64, SECTOR_SIZE);         // bufsize = 512
                               bufsize = min_t(size_t, bufsize, queue_max_hw_sectors(q) << SECTOR_SHIFT); // 512
                               bufsize = min_t(size_t, bufsize, queue_max_segments(q) << PAGE_SHIFT);        // 512
                               while (bufsize >= SECTOR_SIZE) {
                                  buf = __vmalloc(bufsize, GFP_KERNEL | __GFP_ZERO | __GFP_NORETRY);
                                  if (buf) {
                                      *buflen = bufsize;
                                      return buf;
                                  }
                                  bufsize = rounddown(bufsize >> 1, SECTOR_SIZE);
                              }
morbidrsa commented 2 months ago

@Motophan can you run the following bcc python script while re-running the test and collect dmesg afterwards?

#!/usr/bin/env python3
from bcc import BPF
from time import sleep

bpf_program = """
#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>

int trace_nr_zones(struct pt_regs *ctx,
                struct gendisk *disk, sector_t sector,
                unsigned int nr_zones, report_zones_cb cb, void *data)
{
    struct request_queue *q = disk->queue;
    struct queue_limits limits = q->limits;
    unsigned int max_hw_sectors;
    unsigned short max_segments;

    max_hw_sectors = limits.max_hw_sectors;
    max_segments = limits.max_segments;
    bpf_trace_printk("nr_zones=%u, max_hw_sectors=%u, max_segments=%u\\n", nr_zones,
                     max_hw_sectors << 9, max_segments << 12);

    return 0;
}
"""

b = BPF(text=bpf_program)
b.attach_kprobe(event="sd_zbc_report_zones", fn_name="trace_nr_zones")

print("Tracing... Hit Ctrl-C to end")

try:
    sleep(999999)
except KeyboardInterrupt:
    print
Motophan commented 2 months ago

I am running your program now, it takes a few hours to get it to go RO.

What you did not ask for but I provided because I think it has value (ignore my usb device sharting itself throwing kernel errors)

  1. Added swap file of 100GB
  2. BRR some data at 8 drives
  3. A few hours later 2 went down with different errors

dmesg_output.txt

Motophan commented 2 months ago
[58032.314407] BTRFS info (device sdg: state EA): last unmount of filesystem d571ca00-7674-4884-8921-5c980d63142e
[58052.809653] BTRFS info (device sdd: state EA): last unmount of filesystem 75275c75-d561-4f09-8539-c7e469aa8b5e
[58128.605123] ata4: link is slow to respond, please be patient (ready=0)
[58133.308405] ata4: found unknown device (class 0)
[58133.465135] ata4: found unknown device (class 0)
[58133.465149] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[58138.585095] ata4.00: qc timeout after 5000 msecs (cmd 0xec)
[58138.585109] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[58138.585113] ata4.00: revalidation failed (errno=-5)
[58143.938275] ata4: link is slow to respond, please be patient (ready=0)
[58148.641560] ata4: found unknown device (class 0)
[58148.798215] ata4: found unknown device (class 0)
[58148.798230] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[58158.851457] ata4.00: qc timeout after 10000 msecs (cmd 0xec)
[58158.851472] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[58158.851476] ata4.00: revalidation failed (errno=-5)
[58158.851743] ata4: limiting SATA link speed to 3.0 Gbps
[58164.208039] ata4: link is slow to respond, please be patient (ready=0)
[58168.854657] ata4: found unknown device (class 0)
[58169.011313] ata4: found unknown device (class 0)
[58169.011327] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[58200.024286] ata4.00: qc timeout after 30000 msecs (cmd 0xec)
[58200.024300] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[58200.024303] ata4.00: revalidation failed (errno=-5)
[58200.024513] ata4.00: disable device
[58205.380889] ata4: link is slow to respond, please be patient (ready=0)
[58210.027508] ata4: found unknown device (class 0)
[58210.184165] ata4: found unknown device (class 0)
[58210.184179] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[58210.185296] sd 3:0:0:0: [sdd] REPORT ZONES start lba 0 failed
[58210.185485] sd 3:0:0:0: [sdd] REPORT ZONES: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[58210.185488] sd 3:0:0:0: [sdd] Sense Key : Not Ready [current] 
[58210.185490] sd 3:0:0:0: [sdd] Add. Sense: Logical unit not ready, hard reset required
[58210.186558] sd 3:0:0:0: [sdd] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[58210.186562] sd 3:0:0:0: [sdd] tag#21 CDB: Read(16) 88 00 00 00 00 00 00 00 00 02 00 00 00 02 00 00
[58210.186564] I/O error, dev sdd, sector 2 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
[58210.187263] EXT4-fs (sdd): unable to read superblock
[58219.754366] BTRFS: device fsid d571ca00-7674-4884-8921-5c980d63142e devid 1 transid 293 /dev/sdg scanned by mount (124186)
[58219.755231] BTRFS info (device sdg): first mount of filesystem d571ca00-7674-4884-8921-5c980d63142e
[58219.755247] BTRFS info (device sdg): using crc32c (crc32c-intel) checksum algorithm
[58219.755250] BTRFS info (device sdg): using free-space-tree
[58220.028346] BTRFS info (device sdg): host-managed zoned block device /dev/sdg, 52156 zones of 268435456 bytes
[58220.223099] BTRFS info (device sdg): zoned mode enabled with zone size 268435456
[58242.770335] BTRFS info (device sdg): last unmount of filesystem d571ca00-7674-4884-8921-5c980d63142e
[58251.586689] BTRFS: device fsid d571ca00-7674-4884-8921-5c980d63142e devid 1 transid 295 /dev/sdg scanned by mount (124257)
[58251.587365] BTRFS info (device sdg): first mount of filesystem d571ca00-7674-4884-8921-5c980d63142e
[58251.587383] BTRFS info (device sdg): using crc32c (crc32c-intel) checksum algorithm
[58251.587387] BTRFS info (device sdg): using free-space-tree
[58251.810693] BTRFS info (device sdg): host-managed zoned block device /dev/sdg, 52156 zones of 268435456 bytes
[58251.995091] BTRFS info (device sdg): zoned mode enabled with zone size 268435456
[64856.022711] ata9.00: exception Emask 0x0 SAct 0x9000 SErr 0xc0000 action 0x6 frozen
[64856.022720] ata9: SError: { CommWake 10B8B }
[64856.022725] ata9.00: failed command: RECEIVE FPDMA QUEUED
[64856.022727] ata9.00: cmd 65/01:60:00:00:08/00:02:76:01:00/40 tag 12 ncq dma 512 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[64856.022735] ata9.00: status: { DRDY }
[64856.022738] ata9.00: failed command: WRITE FPDMA QUEUED
[64856.022740] ata9.00: cmd 61/40:78:c0:c9:06/05:00:76:01:00/40 tag 15 ncq dma 688128 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[64856.022747] ata9.00: status: { DRDY }
[64856.022751] ata9: hard resetting link
[64856.492691] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[64856.529108] ata9.00: configured for UDMA/133
[64856.539272] ata9.00: device reported invalid CHS sector 0
[64856.539279] ata9.00: device reported invalid CHS sector 0
[64856.539290] ata9: EH complete
[64856.539298] sd 8:0:0:0: [sdg] REPORT ZONES start lba 6275203072 failed
[64856.539305] sd 8:0:0:0: [sdg] REPORT ZONES: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[64856.539308] sd 8:0:0:0: [sdg] Sense Key : Illegal Request [current] 
[64856.539310] sd 8:0:0:0: [sdg] Add. Sense: Unaligned write command
[64856.539317] BTRFS error (device sdg): zoned: failed to read zone 3212903972864 on /dev/sdg (devid 1)
[64856.539321] BTRFS error (device sdg): zoned: cannot recover write pointer for zone 3212903972864
[64856.539328] BTRFS error (device sdg: state A): Transaction aborted (error -5)
[64856.539330] BTRFS: error (device sdg: state A) in find_free_extent_update_loop:4208: errno=-5 IO failure
[64856.539333] BTRFS info (device sdg: state EA): forced readonly
[64856.609605] ata9.00: exception Emask 0x0 SAct 0x100 SErr 0x0 action 0x0
[64856.609611] ata9.00: irq_stat 0x40000008
[64856.609614] ata9.00: failed command: WRITE FPDMA QUEUED
[64856.609616] ata9.00: cmd 61/40:40:c0:c9:06/05:00:76:01:00/40 tag 8 ncq dma 688128 out
                        res 43/04:40:00:cf:06/00:05:76:01:00/00 Emask 0x400 (NCQ error) <F>
[64856.609622] ata9.00: status: { DRDY SENSE ERR }
[64856.609624] ata9.00: error: { ABRT }
[64856.622643] ata9.00: configured for UDMA/133
[64856.632818] sd 8:0:0:0: [sdg] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=31s
[64856.632822] sd 8:0:0:0: [sdg] tag#8 Sense Key : Illegal Request [current] 
[64856.632826] sd 8:0:0:0: [sdg] tag#8 Add. Sense: Unaligned write command
[64856.632828] sd 8:0:0:0: [sdg] tag#8 CDB: Write(16) 8a 00 00 00 00 01 76 06 c9 c0 00 00 05 40 00 00
[64856.632830] critical target error, dev sdg, sector 6274678784 op 0x7:(ZONE_APPEND) flags 0x104000 phys_seg 11 prio class 0
[64856.632837] BTRFS error (device sdg: state EA): bdev /dev/sdg errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[64856.632939] ata9: EH complete
[64888.425788] ata9.00: exception Emask 0x0 SAct 0x600000 SErr 0x0 action 0x6 frozen
[64888.425798] ata9.00: failed command: WRITE FPDMA QUEUED
[64888.425801] ata9.00: cmd 61/40:a8:40:fe:06/05:00:76:01:00/40 tag 21 ncq dma 688128 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[64888.425809] ata9.00: status: { DRDY }
[64888.425812] ata9.00: failed command: RECEIVE FPDMA QUEUED
[64888.425814] ata9.00: cmd 65/01:b0:00:00:00/00:02:00:00:00/40 tag 22 ncq dma 512 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[64888.425820] ata9.00: status: { DRDY }
[64888.425825] ata9: hard resetting link
[64893.779056] ata9: link is slow to respond, please be patient (ready=0)
[64898.429033] ata9: found unknown device (class 0)
[64898.585693] ata9: found unknown device (class 0)
[64898.585708] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[64903.728986] ata9.00: qc timeout after 5000 msecs (cmd 0xec)
[64903.729000] ata9.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[64903.729002] ata9.00: revalidation failed (errno=-5)
[64903.729007] ata9: hard resetting link
[64909.088946] ata9: link is slow to respond, please be patient (ready=0)
[64913.735571] ata9: found unknown device (class 0)
[64913.892233] ata9: found unknown device (class 0)
[64913.892248] ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[64923.995992] ata9.00: qc timeout after 10000 msecs (cmd 0xec)
[64923.996007] ata9.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[64923.996010] ata9.00: revalidation failed (errno=-5)
[64923.996014] ata9: limiting SATA link speed to 3.0 Gbps
[64923.996020] ata9: hard resetting link
[64929.355441] ata9: link is slow to respond, please be patient (ready=0)
[64934.008740] ata9: found unknown device (class 0)
[64934.165421] ata9: found unknown device (class 0)
[64934.165435] ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[64965.168506] ata9.00: qc timeout after 30000 msecs (cmd 0xec)
[64965.168521] ata9.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[64965.168524] ata9.00: revalidation failed (errno=-5)
[64965.168528] ata9.00: disable device
[64970.521774] ata9: link is slow to respond, please be patient (ready=0)
[64975.225077] ata9: found unknown device (class 0)
[64975.381739] ata9: found unknown device (class 0)
[64975.381753] ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
[64975.382831] ata9: EH complete
[64975.382855] sd 8:0:0:0: [sdg] tag#12 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=118s
[64975.382859] sd 8:0:0:0: [sdg] tag#12 CDB: Write(16) 8a 00 00 00 00 01 76 06 fe 40 00 00 05 40 00 00
[64975.382860] I/O error, dev sdg, sector 6274678784 op 0x7:(ZONE_APPEND) flags 0x104000 phys_seg 12 prio class 0
[64975.382862] sd 8:0:0:0: [sdg] REPORT ZONES start lba 0 failed
[64975.382867] sd 8:0:0:0: [sdg] REPORT ZONES: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[64975.382867] BTRFS error (device sdg: state EA): bdev /dev/sdg errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
[64975.382875] sd 8:0:0:0: [sdg] Sense Key : Not Ready [current] 
[64975.382878] sd 8:0:0:0: [sdg] Add. Sense: Logical unit not ready, hard reset required
[64975.382883] sd 8:0:0:0: [sdg] 0 512-byte logical blocks: (0 B/0 B)
[64975.382885] sd 8:0:0:0: [sdg] 4096-byte physical blocks
[64975.382932] sd 8:0:0:0: [sdg] REPORT ZONES start lba 6274678784 failed
[64975.382936] sd 8:0:0:0: [sdg] REPORT ZONES: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[64975.383020] sd 8:0:0:0: [sdg] tag#30 access beyond end of device
[64975.383023] I/O error, dev sdg, sector 6274678784 op 0x7:(ZONE_APPEND) flags 0x104000 phys_seg 12 prio class 0
[64975.383029] BTRFS error (device sdg: state EA): bdev /dev/sdg errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
[64975.383130] sd 8:0:0:0: [sdg] tag#0 access beyond end of device
[64975.383132] I/O error, dev sdg, sector 6274678784 op 0x7:(ZONE_APPEND) flags 0x104000 phys_seg 12 prio class 0
[64975.383136] BTRFS error (device sdg: state EA): bdev /dev/sdg errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
[64975.383215] sd 8:0:0:0: [sdg] tag#1 access beyond end of device
[64975.383217] I/O error, dev sdg, sector 6274678784 op 0x7:(ZONE_APPEND) flags 0x104000 phys_seg 11 prio class 0
[64975.383219] BTRFS error (device sdg: state EA): bdev /dev/sdg errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
[64975.383291] sd 8:0:0:0: [sdg] tag#2 access beyond end of device
[64975.383292] I/O error, dev sdg, sector 6274678784 op 0x7:(ZONE_APPEND) flags 0x104000 phys_seg 18 prio class 0
[64975.383295] BTRFS error (device sdg: state EA): bdev /dev/sdg errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
[64975.383369] sd 8:0:0:0: [sdg] tag#3 access beyond end of device
[64975.383371] I/O error, dev sdg, sector 6274678784 op 0x7:(ZONE_APPEND) flags 0x104000 phys_seg 21 prio class 0
[64975.383373] BTRFS error (device sdg: state EA): bdev /dev/sdg errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
[64975.383448] sd 8:0:0:0: [sdg] tag#4 access beyond end of device
[64975.383450] I/O error, dev sdg, sector 6274678784 op 0x7:(ZONE_APPEND) flags 0x104000 phys_seg 11 prio class 0
[64975.383452] BTRFS error (device sdg: state EA): bdev /dev/sdg errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
[64975.383524] sd 8:0:0:0: [sdg] tag#5 access beyond end of device
[64975.383526] I/O error, dev sdg, sector 6274678784 op 0x7:(ZONE_APPEND) flags 0x104000 phys_seg 12 prio class 0
[64975.383528] BTRFS error (device sdg: state EA): bdev /dev/sdg errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
[64975.383600] sd 8:0:0:0: [sdg] tag#6 access beyond end of device
[64975.383602] I/O error, dev sdg, sector 6274678784 op 0x7:(ZONE_APPEND) flags 0x104000 phys_seg 11 prio class 0
[64975.383604] BTRFS error (device sdg: state EA): bdev /dev/sdg errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
[64975.383677] sd 8:0:0:0: [sdg] tag#7 access beyond end of device
[64975.383679] I/O error, dev sdg, sector 6274678784 op 0x7:(ZONE_APPEND) flags 0x104000 phys_seg 11 prio class 0
[64975.383681] BTRFS error (device sdg: state EA): bdev /dev/sdg errs: wr 11, rd 0, flush 0, corrupt 0, gen 0
[64975.383752] sd 8:0:0:0: [sdg] tag#8 access beyond end of device
[64975.383822] sd 8:0:0:0: [sdg] tag#9 access beyond end of device
[64975.383891] sd 8:0:0:0: [sdg] tag#10 access beyond end of device
[64975.383961] sd 8:0:0:0: [sdg] tag#11 access beyond end of device
[64975.384032] sd 8:0:0:0: [sdg] tag#12 access beyond end of device
[64975.384103] sd 8:0:0:0: [sdg] tag#13 access beyond end of device
[64975.384184] sd 8:0:0:0: [sdg] tag#14 access beyond end of device
[64975.384296] sd 8:0:0:0: [sdg] tag#15 access beyond end of device
[64975.384402] sd 8:0:0:0: [sdg] tag#16 access beyond end of device
[64975.384448] kworker/u65:1: attempt to access beyond end of device
               sdg: rw=1048583, sector=6275727360, nr_sectors = 1344 limit=0
[64975.384490] sd 8:0:0:0: [sdg] tag#17 access beyond end of device
[64975.384605] sd 8:0:0:0: [sdg] tag#18 access beyond end of device
[64975.384610] sd 8:0:0:0: [sdg] tag#19 access beyond end of device
[64975.384644] btrfs-transacti: attempt to access beyond end of device
               sdg: rw=4096, sector=2677363872, nr_sectors = 32 limit=0
[64975.384665] BTRFS error (device sdg: state EA): failed to run delayed ref for logical 1499211956224 num_bytes 65536 type 178 action 1 ref_mod 1: -5
[64975.384671] BTRFS: error (device sdg: state EA) in btrfs_run_delayed_refs:2249: errno=-5 IO failure
[64975.384680] sd 8:0:0:0: [sdg] tag#20 access beyond end of device
[64975.384750] sd 8:0:0:0: [sdg] tag#22 access beyond end of device
[64975.384819] sd 8:0:0:0: [sdg] tag#23 access beyond end of device
[64975.384888] sd 8:0:0:0: [sdg] tag#31 access beyond end of device
[64975.384957] sd 8:0:0:0: [sdg] tag#0 access beyond end of device
[64975.385027] sd 8:0:0:0: [sdg] tag#1 access beyond end of device
[64975.385106] sd 8:0:0:0: [sdg] tag#2 access beyond end of device
[64975.385176] sd 8:0:0:0: [sdg] tag#3 access beyond end of device
[64975.385245] sd 8:0:0:0: [sdg] tag#4 access beyond end of device
[64975.385315] sd 8:0:0:0: [sdg] tag#5 access beyond end of device
[64975.385384] sd 8:0:0:0: [sdg] tag#6 access beyond end of device
[64975.385453] sd 8:0:0:0: [sdg] tag#7 access beyond end of device
[64975.385528] sd 8:0:0:0: [sdg] tag#8 access beyond end of device
[64975.385666] sd 8:0:0:0: [sdg] tag#9 access beyond end of device
[64975.385736] sd 8:0:0:0: [sdg] tag#10 access beyond end of device
[64975.385747] kworker/u65:1: attempt to access beyond end of device
               sdg: rw=1048583, sector=6275727360, nr_sectors = 1344 limit=0
[64975.385859] sd 8:0:0:0: [sdg] tag#11 access beyond end of device
[64975.385935] sd 8:0:0:0: [sdg] tag#12 access beyond end of device
[64975.386005] sd 8:0:0:0: [sdg] tag#13 access beyond end of device
[64975.386095] sd 8:0:0:0: [sdg] tag#21 access beyond end of device
[64975.386186] sd 8:0:0:0: [sdg] tag#24 access beyond end of device
[64975.386278] sd 8:0:0:0: [sdg] tag#25 access beyond end of device
[64975.386369] sd 8:0:0:0: [sdg] tag#26 access beyond end of device
[64975.386461] sd 8:0:0:0: [sdg] tag#27 access beyond end of device
[64975.386553] sd 8:0:0:0: [sdg] tag#28 access beyond end of device
[64975.386645] sd 8:0:0:0: [sdg] tag#29 access beyond end of device
[64975.386736] sd 8:0:0:0: [sdg] tag#30 access beyond end of device
[64975.386834] sd 8:0:0:0: [sdg] tag#0 access beyond end of device
[64975.386913] kworker/u65:1: attempt to access beyond end of device
               sdg: rw=1048583, sector=6275727360, nr_sectors = 1344 limit=0
[64975.386925] sd 8:0:0:0: [sdg] tag#1 access beyond end of device
[64975.387017] sd 8:0:0:0: [sdg] tag#2 access beyond end of device
[64975.387105] sd 8:0:0:0: [sdg] tag#3 access beyond end of device
[64975.387175] sd 8:0:0:0: [sdg] tag#4 access beyond end of device
[64975.387244] sd 8:0:0:0: [sdg] tag#5 access beyond end of device
[64975.387313] sd 8:0:0:0: [sdg] tag#6 access beyond end of device
[64975.387382] sd 8:0:0:0: [sdg] tag#7 access beyond end of device
[64975.387451] sd 8:0:0:0: [sdg] tag#8 access beyond end of device
[64975.387520] sd 8:0:0:0: [sdg] tag#9 access beyond end of device
[64975.387589] sd 8:0:0:0: [sdg] tag#10 access beyond end of device
[64975.387658] sd 8:0:0:0: [sdg] tag#11 access beyond end of device
[64975.387727] sd 8:0:0:0: [sdg] tag#12 access beyond end of device
[64975.387798] sdg: detected capacity change from 27344764928 to 0
[64975.387902] kworker/u65:1: attempt to access beyond end of device
               sdg: rw=1048583, sector=6275727360, nr_sectors = 1344 limit=0
[64975.388791] kworker/u65:1: attempt to access beyond end of device
               sdg: rw=1048583, sector=6275727360, nr_sectors = 1344 limit=0
[64975.389558] kworker/u65:1: attempt to access beyond end of device
               sdg: rw=1048583, sector=6275727360, nr_sectors = 1344 limit=0
[64975.390268] kworker/u65:1: attempt to access beyond end of device
               sdg: rw=1048583, sector=6275727360, nr_sectors = 1344 limit=0
[64975.390937] kworker/u65:1: attempt to access beyond end of device
               sdg: rw=1048583, sector=6275727360, nr_sectors = 1344 limit=0
[64975.391538] kworker/u65:1: attempt to access beyond end of device
               sdg: rw=1048583, sector=6275727360, nr_sectors = 1344 limit=0
[64975.564724] sd 8:0:0:0: [sdg] tag#22 access beyond end of device
[65042.187519] bio_check_eod: 369 callbacks suppressed
[65042.187522] kworker/u65:5: attempt to access beyond end of device
               sdg: rw=1048583, sector=6273105920, nr_sectors = 640 limit=0
[65042.187527] btrfs_dev_stat_inc_and_print: 431 callbacks suppressed
[65042.187528] BTRFS error (device sdg: state EA): bdev /dev/sdg errs: wr 441, rd 2, flush 0, corrupt 0, gen 0

I am not sure if your program did anything, it did not print anything on screen. I ran it, then started the copy, then it went RO. Also this time when I lsblk the drive is 0B. Sometimes it does this when it goes RO.


[gentoo@archlinux ~]$ sudo ./morbidrsa_program.py 
Tracing... Hit Ctrl-C to end
^C[gentoo@archlinux ~]$ 

sdg 8:96 0 0B 0 disk /home/gentoo/btrfs_test2

naota commented 2 months ago

Since it uses bpf_trace_printk, you can retrieve the log with this.

$ sudo cat /sys/kernel/debug/tracing/trace_pipe

Motophan commented 2 months ago

well thats unfortunate as I need to redo it

naota commented 2 months ago

BTW, in the log here, I can see a lot of device resets and errors like this. https://github.com/kdave/btrfs-progs/issues/779#issuecomment-2067471115

[34567.117262] scsi host11: uas_eh_device_reset_handler start [34567.190732] usb 6-1.1.3: reset SuperSpeed USB device number 13 using xhci_hcd [34567.208600] scsi host11: uas_eh_device_reset_handler success [34665.435892] sd 11:0:0:0: [sdj] tag#7 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD IN [34665.435903] sd 11:0:0:0: [sdj] tag#7 CDB: Read(16) 88 00 00 00 00 00 25 77 7b 48 00 00 01 00 00 00 [34665.435966] sd 11:0:0:0: [sdj] tag#6 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD IN [34665.435969] sd 11:0:0:0: [sdj] tag#6 CDB: Read(16) 88 00 00 00 00 00 25 77 79 48 00 00 02 00 00 00 [34689.328898] sd 11:0:0:0: [sdj] tag#15 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD IN [34689.328910] sd 11:0:0:0: [sdj] tag#15 CDB: Read(16) 88 00 00 00 00 00 8d 8d 5b e0 00 00 00 18 00 00 [34689.375569] scsi host11: uas_eh_device_reset_handler start

Also, in https://github.com/kdave/btrfs-progs/issues/779#issuecomment-2067524950 There are many device errors.

[58128.605123] ata4: link is slow to respond, please be patient (ready=0) [58133.308405] ata4: found unknown device (class 0) [58133.465135] ata4: found unknown device (class 0) [58133.465149] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [58138.585095] ata4.00: qc timeout after 5000 msecs (cmd 0xec) [58138.585109] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4) [58138.585113] ata4.00: revalidation failed (errno=-5) [58143.938275] ata4: link is slow to respond, please be patient (ready=0) [58148.641560] ata4: found unknown device (class 0) [58148.798215] ata4: found unknown device (class 0) [58148.798230] ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [58158.851457] ata4.00: qc timeout after 10000 msecs (cmd 0xec) [58158.851472] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4) [58158.851476] ata4.00: revalidation failed (errno=-5)

And, these indicate the device (sdg) itself goes away due to the drive errors, which made the device 0 size.

[64975.382875] sd 8:0:0:0: [sdg] Sense Key : Not Ready [current] [64975.382878] sd 8:0:0:0: [sdg] Add. Sense: Logical unit not ready, hard reset required [64975.382883] sd 8:0:0:0: [sdg] 0 512-byte logical blocks: (0 B/0 B) [64975.382885] sd 8:0:0:0: [sdg] 4096-byte physical blocks [64975.382932] sd 8:0:0:0: [sdg] REPORT ZONES start lba 6274678784 failed [64975.382936] sd 8:0:0:0: [sdg] REPORT ZONES: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [64975.383020] sd 8:0:0:0: [sdg] tag#30 access beyond end of device

So, it seems your device (or connection) is really unstable. Maybe, there are too many devices connected and insufficient power. Or, there might be some issue on the device side.

Motophan commented 2 months ago

I had a unstable fclk memory OC on the last dmesg. I forgot to switch it back down. Running ram at JDEC standards 2666MT/s now, redoing test.

I did not have the unstable fclk during the initial tests because I specifically set it to base speed for troubleshooting. Sorry for the inconvenience.

Motophan commented 2 months ago

no_fclk_unstability.dmesg.txt (forgot the morbinrsa app whoops lol)

This is w/ a 100GB pagefile w/ 128GB system ram. It says out of memory though :thinking:


[gentoo@archlinux plot_temp]$ sudo btrfs filesystem usage /mnt/hdd05
Overall:
    Device size:          12.73TiB
    Device allocated:          3.66TiB
    Device unallocated:        9.07TiB
    Device missing:          0.00B
    Device slack:            0.00B
    Device zone unusable:      1.50GiB
    Device zone size:        256.00MiB
    Used:              3.66TiB
    Free (estimated):          9.08TiB  (min: 4.54TiB)
    Free (statfs, df):         9.08TiB
    Data ratio:               1.00
    Metadata ratio:           2.00
    Global reserve:      512.00MiB  (used: 48.00KiB)
    Multiple profiles:              no

Data,single: Size:3.65TiB, Used:3.65TiB (99.97%)
   /dev/sde    3.65TiB

Metadata,DUP: Size:5.75GiB, Used:4.90GiB (85.28%)
   /dev/sde   11.50GiB

System,DUP: Size:256.00MiB, Used:1.55MiB (0.60%)
   /dev/sde  512.00MiB

Unallocated:
   /dev/sde    9.07TiB

2nd.dmesg.txt

morbids app didnt print anything I had it running in a terminal, so I catted the dir that naota said, and just >> that into a file. I was not sitting at the computer when it did its thing so idk if this caused a problem.

bpf_trace_prink.txt

morbidrsa commented 2 months ago

So the trace doesn’t show anything unusual. Still no idea why we are OOM here.

morbidrsa commented 2 months ago

@Motophan can you please apply this debug patch, so we can see why we're falling down to a vmalloc of 0.

diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 26af5ab7d7c1..7b94a1cabafd 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -221,6 +221,8 @@ static void *sd_zbc_alloc_report_buffer(struct scsi_disk *sdkp,
        bufsize = min_t(size_t, bufsize, queue_max_segments(q) << PAGE_SHIFT);

        while (bufsize >= SECTOR_SIZE) {
+               printk(KERN_ERR "nr_zones=%u, sdkp->zone_info.nr_zones=%u, bufsize=%zu, sdkp->capacity=%llu, sdkp->zone_info.zone_blocks=%u\n",
+                      nr_zones, sdkp->zone_info.nr_zones, bufsize, sdkp->capacity,  sdkp->zone_info.zone_blocks);
                buf = __vmalloc(bufsize,
                                GFP_KERNEL | __GFP_ZERO | __GFP_NORETRY);
                if (buf) {
Motophan commented 2 months ago

I dont know how to apply this. I am on arch linux and am not sure what to do. Assume I have gcc and all that installed do I pull the repo, open the file you mentioned with nano or something and then save it, then compile? Then replace btrfs-progs with it?

Usually the compile and replace works with a binary but btrfs-progs is more kernel level and im not sure that would be the proper procedure.

morbidrsa commented 2 months ago

OK then let's try something different here. Can you re-test with the Nvidia kernel module unloaded? It is an out-of-tree module and we can't be sure it's not doing stupid things.

Motophan commented 2 months ago

no_nvidia_drivers_dmesg.txt 4x 18tb -> 14tb hmsmr --bwlimit=70M no problem 6x 18tb -> 14tb hmsmr --bwlimit=70M problem in 1 hour mv (no speed limit) problems in 1 hour.

128gb ram, 100gb pagefile.

morbidrsa commented 2 months ago

This is where everything goes south:

Apr 28 18:53:14.763628 archlinux kernel: ata6.00: exception Emask 0x0 SAct 0x104000 SErr 0xc0000 action 0x6 frozen Apr 28 18:53:14.763963 archlinux kernel: ata6: SError: { CommWake 10B8B } Apr 28 18:53:14.763995 archlinux kernel: ata6.00: failed command: RECEIVE FPDMA QUEUED Apr 28 18:53:14.764028 archlinux kernel: ata6.00: cmd 65/01:70:00:00:88/00:02:7e:04:00/40 tag 14 ncq dma 512 in res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Apr 28 18:53:14.764051 archlinux kernel: ata6.00: status: { DRDY } Apr 28 18:53:14.764071 archlinux kernel: ata6.00: failed command: WRITE FPDMA QUEUED Apr 28 18:53:14.764091 archlinux kernel: ata6.00: cmd 61/40:a0:80:5a:87/05:00:7e:04:00/40 tag 20 ncq dma 688128 out res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Apr 28 18:53:14.767355 archlinux kernel: ata6.00: status: { DRDY } Apr 28 18:53:14.767403 archlinux kernel: ata6: hard resetting link Apr 28 18:53:15.240541 archlinux kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Apr 28 18:53:15.263982 archlinux kernel: ata6.00: configured for UDMA/133 Apr 28 18:53:15.273586 archlinux kernel: ata6.00: device reported invalid CHS sector 0 Apr 28 18:53:15.273648 archlinux kernel: ata6.00: device reported invalid CHS sector 0 Apr 28 18:53:15.273668 archlinux kernel: ata6: EH complete Apr 28 18:53:15.273688 archlinux kernel: sd 5:0:0:0: [sdd] REPORT ZONES start lba 19302711296 failed Apr 28 18:53:15.274016 archlinux kernel: sd 5:0:0:0: [sdd] REPORT ZONES: Result: hostbyte=DID_OK driverbyte=DRIVER_OK Apr 28 18:53:15.274244 archlinux kernel: sd 5:0:0:0: [sdd] Sense Key : Illegal Request [current] Apr 28 18:53:15.274471 archlinux kernel: sd 5:0:0:0: [sdd] Add. Sense: Unaligned write command Apr 28 18:53:15.274684 archlinux kernel: BTRFS error (device sdd): zoned: failed to read zone 9882988183552 on /dev/sdd (devid 1) Apr 28 18:53:15.274704 archlinux kernel: BTRFS error (device sdd): zoned: cannot recover write pointer for zone 9882988183552 Apr 28 18:53:15.274728 archlinux kernel: BTRFS error (device sdd: state A): Transaction aborted (error -5) Apr 28 18:53:15.274748 archlinux kernel: BTRFS: error (device sdd: state A) in find_free_extent_update_loop:4208: errno=-5 IO failure Apr 28 18:53:15.274849 archlinux kernel: BTRFS info (device sdd: state EA): forced readonly

So the device ata6.00/sdd times out on a REPORT ZONES command. This then results in btrfs not being able to recover the zone write pointer and thus its forcing the FS to go read-only to prevent further damage.

Now we need to find out if it's the kernel that causes sdd to timeout or the drive firmware.

Motophan commented 2 months ago

Its load related. More load = this happens faster. Less load = this may not happen or happen every 5TB of data (user is not likely do dump massive amounts of data on their drives and not stop)

I can do the diff thing if you explain it to me or link a guide. I can follow directions I just am not a developer so I have never used git other than to compile the latest version of an app.