kdave / btrfs-progs

Development of userspace BTRFS tools
GNU General Public License v2.0
527 stars 239 forks source link

btrfs failed to read-only state due to btrfs_free_extent:3116: errno=-28 #774

Closed eliran-zada-zesty closed 2 months ago

eliran-zada-zesty commented 3 months ago

Problem

A btrfs file system of about 8TB serving a Kafka workload on an Amazon Linux 2 with kernel version 5.4 failed to read-only state.

Instance data

kernel version: 5.4.271-184.369.amzn2.aarch64 OS type: Amazon Linux 2 CPU Type: Arm-based AWS Graviton2 btrfs-progs: 4.15.1

Instance state per-failure

Memory: 62.0% CPU: 27.6%-7.3%

BTRFS state per-failure

{ "Overall": { "Device size": "8.06TiB", "Device allocated": "8.06TiB", "Device unallocated": "4.00MiB", "Device missing": "0.00B", "Used": "7.56TiB", "Free (estimated)": "512.43GiB (min: 512.43GiB)", "Data ratio": "1.00", "Metadata ratio": "1.00", "Global reserve": "512.00MiB (used: 0.00B)" }, "Data,single": { "Size": "8.05TiB", "Used": "7.55TiB", "/dev/nvme13n1": "7.67TiB", "/dev/nvme14n1": "143.00GiB", "/dev/nvme4n1": "105.00GiB", "/dev/nvme8n1": "143.00GiB" }, "Metadata,single": { "Size": "12.00GiB", "Used": "10.79GiB", "/dev/nvme13n1": "12.00GiB" }, "System,single": { "Size": "32.00MiB", "Used": "976.00KiB", "/dev/nvme13n1": "32.00MiB" }, "Unallocated": { "/dev/nvme13n1": "1.00MiB", "/dev/nvme14n1": "1.00MiB", "/dev/nvme4n1": "1.00MiB", "/dev/nvme8n1": "1.00MiB" } }

Observed issue


[Wed Apr  3 12:12:28 2024] BTRFS: error (device nvme13n1) in __btrfs_free_extent:3116: errno=-28 No space left
[Wed Apr  3 12:12:28 2024] BTRFS: error (device nvme13n1) in __btrfs_free_extent:3116: errno=-28 No space left
[Wed Apr  3 12:12:28 2024] BTRFS info (device nvme13n1): forced readonly
[Wed Apr  3 12:12:28 2024] BTRFS: error (device nvme13n1) in btrfs_run_delayed_refs:2219: errno=-28 No space left
[Wed Apr  3 12:12:28 2024] BTRFS: error (device nvme13n1) in __btrfs_free_extent:3116: errno=-28 No space left
[Wed Apr  3 12:12:28 2024] BTRFS: error (device nvme13n1) in btrfs_run_delayed_refs:2219: errno=-28 No space left
[Wed Apr  3 12:12:28 2024] BTRFS: error (device nvme13n1) in __btrfs_free_extent:3116: errno=-28 No space left
[Wed Apr  3 12:12:28 2024] BTRFS: error (device nvme13n1) in btrfs_run_delayed_refs:2219: errno=-28 No space left
[Wed Apr  3 12:12:28 2024] BTRFS: Transaction aborted (error -28)
[Wed Apr  3 12:12:28 2024] BTRFS: error (device nvme13n1) in btrfs_run_delayed_refs:2219: errno=-28 No space left
[Wed Apr  3 12:12:28 2024] WARNING: CPU: 7 PID: 2368 at fs/btrfs/extent-tree.c:3116 __btrfs_free_extent.isra.50+0x670/0xb64 [btrfs]
[Wed Apr  3 12:12:29 2024] Modules linked in: vfat fat dm_mirror dm_region_hash dm_log dm_mod ghash_ce sha2_ce sha256_arm64 ena sha1_ce ptp pps_core auth_rpcgss sunrpc btrfs xor xor_neon zstd_decompress zstd_compress raid6_pq
[Wed Apr  3 12:12:29 2024] CPU: 7 PID: 2368 Comm: kafka-scheduler Not tainted 5.4.271-184.369.amzn2.aarch64 #1
[Wed Apr  3 12:12:29 2024] Hardware name: Amazon EC2 c6g.4xlarge/, BIOS 1.0 11/1/2018
[Wed Apr  3 12:12:29 2024] pstate: 40400005 (nZcv daif +PAN -UAO)
[Wed Apr  3 12:12:29 2024] pc : __btrfs_free_extent.isra.50+0x670/0xb64 [btrfs]
[Wed Apr  3 12:12:29 2024] lr : __btrfs_free_extent.isra.50+0x670/0xb64 [btrfs]
[Wed Apr  3 12:12:29 2024] sp : ffff800012aaba00
[Wed Apr  3 12:12:29 2024] x29: ffff800012aaba90 x28: 0000000000000005 
[Wed Apr  3 12:12:29 2024] x27: 0000000000000000 x26: 0000000000000000 
[Wed Apr  3 12:12:29 2024] x25: 00000000ffffffe4 x24: ffff000b2e810000 
[Wed Apr  3 12:12:29 2024] x23: 0000000000003000 x22: ffff00077117c540 
[Wed Apr  3 12:12:29 2024] x21: ffff000afa7cad00 x20: 00004df57dd7c000 
[Wed Apr  3 12:12:29 2024] x19: 0000000000ae8cf0 x18: ffffffffffffffff 
[Wed Apr  3 12:12:29 2024] x17: 0000000000000000 x16: 0000000000000000 
[Wed Apr  3 12:12:29 2024] x15: ffff800010fa9740 x14: 0720072007200720 
[Wed Apr  3 12:12:29 2024] x13: 0720072007200720 x12: 0720072007200720 
[Wed Apr  3 12:12:29 2024] x11: 0720072007200720 x10: 0720072007200720 
[Wed Apr  3 12:12:29 2024] x9 : 0720072007200720 x8 : 0720072007200720 
[Wed Apr  3 12:12:29 2024] x7 : 0000000000000000 x6 : ffff000b3ace2990 
[Wed Apr  3 12:12:29 2024] x5 : ffff000b3ace2990 x4 : 0000000000000000 
[Wed Apr  3 12:12:29 2024] x3 : ffff000b3acf27c8 x2 : ffff000b3ace2990 
[Wed Apr  3 12:12:29 2024] x1 : a63c41a62ec29d00 x0 : 0000000000000000 
[Wed Apr  3 12:12:29 2024] Call trace:
[Wed Apr  3 12:12:29 2024]  __btrfs_free_extent.isra.50+0x670/0xb64 [btrfs]
[Wed Apr  3 12:12:29 2024]  __btrfs_run_delayed_refs+0x268/0x1240 [btrfs]
[Wed Apr  3 12:12:29 2024]  btrfs_run_delayed_refs+0x80/0x2e0 [btrfs]
[Wed Apr  3 12:12:29 2024]  btrfs_commit_transaction+0x68/0xbc0 [btrfs]
[Wed Apr  3 12:12:29 2024]  btrfs_sync_file+0x454/0x4b0 [btrfs]
[Wed Apr  3 12:12:29 2024]  vfs_fsync_range+0x4c/0x88
[Wed Apr  3 12:12:29 2024]  do_fsync+0x48/0x7c
[Wed Apr  3 12:12:29 2024]  __arm64_sys_fsync+0x24/0x40
[Wed Apr  3 12:12:29 2024]  el0_svc_common.constprop.3+0x108/0x220
[Wed Apr  3 12:12:29 2024]  el0_svc_handler+0x34/0xb0
[Wed Apr  3 12:12:29 2024]  el0_svc+0x10/0x140
[Wed Apr  3 12:12:29 2024] ---[ end trace a62a6ebb549ecb4f ]---
[Wed Apr  3 12:12:29 2024] BTRFS: error (device nvme13n1) in __btrfs_free_extent:3116: errno=-28 No space left
[Wed Apr  3 12:12:29 2024] BTRFS: error (device nvme13n1) in btrfs_run_delayed_refs:2219: errno=-28 No space left
[Wed Apr  3 13:00:51 2024] BTRFS error (device nvme13n1): parent transid verify failed on 109308518629376 wanted 1442154 found 1442149
[Wed Apr  3 13:00:51 2024] BTRFS error (device nvme13n1): parent transid verify failed on 109308518629376 wanted 1442154 found 1442149```
Forza-tng commented 3 months ago

Looks like you're out of unallocated disk space. "Device unallocated": "4.00MiB",

michaelamar1991 commented 3 months ago

@Forza-tng Hey, i had same issue as mentioned in this ticket, can u elaborate a bit more what it means that i ran out of unallocated disk space? what can lead to it and how can i prevent it? Thanks in advance :)

eliran-zada-zesty commented 3 months ago

Thanks @Forza-tng for your response! Not sure how it's related since Used (7.56TiB) vs Device size/allocated (8.06TiB) shows that there's a huge space... can you please elaborate? Also, if it is related? Can I avoid it somehow?

Forza-tng commented 3 months ago

Btrfs uses a multi-stage allocator to manager its storage. Btrfs has three different types of data; System, Metadata and Data. Btrfs divides the available disk space into chunks or block groups dedicated for each data type. When Btrfs needs to write to the filesystem it has to write this into a block group that is dedicated for the specific type of data.

The block group allocation happens as it is needed. So, on a new filesystem, most of the disk space is in an unallocated state, and Btrfs can claim that into new block groups as needed. A chunk is allocated 1GiB at a time. A Block group is 1 or several chunks combined according to the profile used. So for SINGLE profile, a block group is simply one chunk, or 1 GiB, but for RAID1, a block group consists of 2 chunks, one on each device.

The no disk space errors (or ENOSPC as sometimes seen) means that Btrfs has no more unallocated disk space to create additional block groups.

Let's look at the @eliran-zada-zesty problem:

My guess here is that Btrfs wants to allocate additional METADATA block group, but there is only 4MiB unallocated that can be used. Therefore Btrfs cannot finish its transaction and turns the filesystem read-only to protect against corruption. We can see there is about 500GiB allocated, but unused DATA block groups. These could be reclaimed into unallocated using btrfs balance.

@eliran-zada-zesty If you can unmount and mount the filesystem and it does not immediately turn read-only, you can try btrfs balance start -dusage=0 to relcaim empty block groups, so that additional metadata block groups can be allocated.

Normally, Btrfs will reclaim empty block groups, but sometimes we end up in situations where the block groups are underused and cannot be automatically reclaimed. This is why using it is important to monitor and balance data block groups as needed.

I usually use btrfs filesystem usage -T to view disk space allocation. Here is one of my systems:

# btrfs fi us -T /
Overall:
    Device size:        203.89GiB
    Device allocated:       127.07GiB
    Device unallocated:      76.82GiB
    Device missing:         0.00B
    Device slack:        24.00GiB
    Used:            92.52GiB
    Free (estimated):        93.72GiB   (min: 55.32GiB)
    Free (statfs, df):       93.72GiB
    Data ratio:              1.00
    Metadata ratio:          2.00
    Global reserve:     512.00MiB   (used: 0.00B)
    Multiple profiles:             no

                  Data      Metadata System
Id Path           single    DUP      DUP      Unallocated Total     Slack
-- -------------- --------- -------- -------- ----------- --------- --------
 1 /dev/nvme0n1p5 101.01GiB 26.00GiB 64.00MiB    76.82GiB 203.89GiB 24.00GiB
-- -------------- --------- -------- -------- ----------- --------- --------
   Total          101.01GiB 13.00GiB 32.00MiB    76.82GiB 203.89GiB 24.00GiB
   Used            84.10GiB  4.21GiB 16.00KiB
eliran-zada-zesty commented 3 months ago

@Forza-tng you're awesome! Your insights are very helpful and make sense, I'll check it out.

eliran-zada-zesty commented 3 months ago

@Forza-tng do you have any idea how I can recreate it? Couldn't do it...

Forza-tng commented 2 months ago

@eliran-zada-zesty

@Forza-tng do you have any idea how I can recreate it? Couldn't do it...

We need some details on the exact commands you used as well as dmesg output.

  1. make sure the filesystem is not mounted
  2. do mount /dev/nvme13n1 /mnt/btrfs
  3. check dmesg for the result
  4. if not to, do btrfs balance start -dusage=0
  5. If that went well, do btrfs balance start -dusage=50,limit=1

You are also using a very old kernel and btrfs-progs. It may be that it's possible to mount and correct the issue on newer kernels, such as a Fedora live CD.

You may get quicker responses on the IRC channel #btrfs on Libera.chat.

eliran-zada-zesty commented 2 months ago

Thanks! will check