koverstreet / bcachefs

Other
675 stars 69 forks source link

bcachefs does not release space of deleted data when running fio in an endlessloop #667

Open sithglan opened 5 months ago

sithglan commented 5 months ago

If I run the following fio commands in an endless loop after some time, the filesystem fills up but does not contain any data. Unmounting and mounting the filesystem releases the space and makes it possible to use the filesystem again.

Steps to reproduce:

modprobe brd rd_nr=1 rd_size=48000000 max_part=1
bcachefs format /dev/ram0
mount /dev/ram0 /mnt
while true; do
fio --ioengine=libaio --filesize=1G --ramp_time=2s --runtime=1m --numjobs=16 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=write --blocksize=1m --iodepth=64 --readwrite=write --unlink=1
fio --ioengine=libaio --filesize=1G --ramp_time=2s --runtime=1m --numjobs=16 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=randwrite --blocksize=4k --iodepth=256 --readwrite=randwrite --unlink=1
fio --ioengine=libaio --filesize=1G --ramp_time=2s --runtime=1m --numjobs=16 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=read --blocksize=1m --iodepth=64 --readwrite=read --unlink=1
fio --ioengine=libaio --filesize=1G --ramp_time=2s --runtime=1m --numjobs=16 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=randread --blocksize=4k --iodepth=256 --readwrite=randread --unlink=1

fio --ioengine=libaio --filesize=32G --ramp_time=2s --runtime=1m --numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=write --blocksize=1m --iodepth=1 --readwrite=write --unlink=1
fio --ioengine=libaio --filesize=32G --ramp_time=2s --runtime=1m --numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=randwrite --blocksize=4k --iodepth=1 --readwrite=randwrite --unlink=1
fio --ioengine=libaio --filesize=16G --ramp_time=2s --runtime=1m --numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=read --blocksize=1m --iodepth=1 --readwrite=read --unlink=1
fio --ioengine=libaio --filesize=16G --ramp_time=2s --runtime=1m --numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=randread --blocksize=4k --iodepth=1 --readwrite=randread --unlink=1
done

Wait of 20 minutes - 1 hour and the filesystem is full. I used a ramdisk to not stress my flash devices, however it works also with flash devices and in fact the first time I hit the issue, it was with a KINGSTON SKC3000S1024G NVMe M.2 drive.

This was with:

$ uname -a
Linux midi 6.9.0-rc3 #49 SMP PREEMPT_DYNAMIC Mon Apr  8 12:26:10 CEST 2024 x86_64 GNU/Linux
$ bcachefs version
1.6.4
YellowOnion commented 5 months ago

is copygc kernel thread using lots of CPU?

sithglan commented 5 months ago

Hello Daniel,

is copygc kernel thread using lots of CPU?

no, it is not:

(midi) [~] df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/ram0 42G 42G 0 100% /mnt (midi) [~] find /mnt /mnt /mnt/lost+found (midi) [~] ps axuwww | grep copygc root 2460 0.0 0.0 0 0 ? S 08:14 0:00 [bch-copygc/ram0] root 4724 0.0 0.0 7976 1792 pts/2 S+ 08:28 0:00 grep --color=auto copygc

This time I was able to reproduce it quickly (with current Linus Torvalds git tip):

!/bin/bash

set -x

uname -a

while [ "$(df -k /mnt | tail -1 | awk '{print $4}')" -ge 21553152 ]; do date fio --ioengine=libaio --filesize=1G --ramp_time=2s --runtime=1m --numjobs=16 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=write --blocksize=1m --iodepth=64 --readwrite=write --unlink=1 fio --ioengine=libaio --filesize=1G --ramp_time=2s --runtime=1m --numjobs=16 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=randwrite --blocksize=4k --iodepth=256 --readwrite=randwrite --unlink=1 fio --ioengine=libaio --filesize=1G --ramp_time=2s --runtime=1m --numjobs=16 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=read --blocksize=1m --iodepth=64 --readwrite=read --unlink=1 fio --ioengine=libaio --filesize=1G --ramp_time=2s --runtime=1m --numjobs=16 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=randread --blocksize=4k --iodepth=256 --readwrite=randread --unlink=1

fio --ioengine=libaio --filesize=32G --ramp_time=2s --runtime=1m --numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=write --blocksize=1m --iodepth=1 --readwrite=write --unlink=1 fio --ioengine=libaio --filesize=32G --ramp_time=2s --runtime=1m --numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=randwrite --blocksize=4k --iodepth=1 --readwrite=randwrite --unlink=1 fio --ioengine=libaio --filesize=16G --ramp_time=2s --runtime=1m --numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=read --blocksize=1m --iodepth=1 --readwrite=read --unlink=1 fio --ioengine=libaio --filesize=16G --ramp_time=2s --runtime=1m --numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=randread --blocksize=4k --iodepth=1 --readwrite=randread --unlink=1 df -h ps axuwww sleep 1

done;

Log file: https://tg.st/u/05b1816b3a6f97dff582a9a947a44e288f3bd545fefee37cbcefc1f66b895b6e.fio.log.zst

This time when the script aborted, there were some files lying around (they should not since fio is instructed to delete them). However once I deleted them manually, the space did no free up:

(midi) [~] df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/ram0 42G 42G 0 100% /mnt (midi) [~] ls -al /mnt/ total 4 drwxr-xr-x 3 root root 0 Apr 13 08:23 . drwxr-xr-x 22 root root 4096 Mar 25 19:44 .. drwx------ 2 root root 0 Apr 13 08:14 lost+found -rw-r--r-- 1 root root 17179869184 Apr 13 08:23 randread.0.0 -rw-r--r-- 1 root root 17179869184 Apr 13 08:23 read.0.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.1.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.10.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.2.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.3.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.4.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.5.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.6.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.7.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.8.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.9.0 (midi) [~] rm /mnt/r* (midi) [~] df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/ram0 42G 42G 0 100% /mnt (midi) [~] ps axuwww | grep copygc root 2460 0.0 0.0 0 0 ? S 08:14 0:00 [bch-copygc/ram0] root 4684 0.0 0.0 7976 1792 pts/2 S+ 08:26 0:00 grep --color=auto copygc

Also: unmount / mounting releases the space:

(midi) [~] df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/ram0 42G 42G 0 100% /mnt (midi) [~] find /mnt /mnt /mnt/lost+found (midi) [~] umount /mnt (midi) [~] mount /dev/ram0 /mnt bcachefs (/dev/nvme0n1): error reading default superblock: Not a bcachefs superblock (got magic 00000000-0000-0000-0000-000000000000) bcachefs (/dev/nvme0n1): error reading superblock: Not a bcachefs superblock layout bcachefs (/dev/nvme0n1p1): error reading default superblock: Not a bcachefs superblock (got magic 00000000-0000-0000-0000-000000000000) bcachefs (/dev/nvme0n1p1): error reading superblock: Not a bcachefs superblock layout bcachefs (/dev/nvme0n1p2): error reading default superblock: Not a bcachefs superblock (got magic 3d405d4b-dd1f-d494-0000-000000000000) bcachefs (/dev/nvme0n1p2): error reading superblock: Not a bcachefs superblock layout bcachefs (/dev/nvme0n1p3): error reading default superblock: Not a bcachefs superblock (got magic ef59b91d-e41f-55c1-0000-000000000000) bcachefs (/dev/nvme0n1p3): error reading superblock: Not a bcachefs superblock layout bcachefs (/dev/loop0): error reading default superblock: IO error: -5 bcachefs (/dev/loop0): error reading superblock: IO error: -5 bcachefs (/dev/loop1): error reading default superblock: IO error: -5 bcachefs (/dev/loop1): error reading superblock: IO error: -5 bcachefs (/dev/loop2): error reading default superblock: IO error: -5 bcachefs (/dev/loop2): error reading superblock: IO error: -5 bcachefs (/dev/loop3): error reading default superblock: IO error: -5 bcachefs (/dev/loop3): error reading superblock: IO error: -5 bcachefs (/dev/loop4): error reading default superblock: IO error: -5 bcachefs (/dev/loop4): error reading superblock: IO error: -5 bcachefs (/dev/loop5): error reading default superblock: IO error: -5 bcachefs (/dev/loop5): error reading superblock: IO error: -5 bcachefs (/dev/loop6): error reading default superblock: IO error: -5 bcachefs (/dev/loop6): error reading superblock: IO error: -5 bcachefs (/dev/loop7): error reading default superblock: IO error: -5 bcachefs (/dev/loop7): error reading superblock: IO error: -5 (midi) [~] df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/ram0 42G 316M 41G 1% /mnt (midi) [~] find /mnt /mnt /mnt/lost+found

Cheers, Thomas

sithglan commented 5 months ago

On Sat, Apr 13, 2024 at 08:36:04AM +0200, Thomas Glanzmann wrote:

Hello Daniel,

is copygc kernel thread using lots of CPU?

no, it is not:

(midi) [~] df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/ram0 42G 42G 0 100% /mnt (midi) [~] find /mnt /mnt /mnt/lost+found (midi) [~] ps axuwww | grep copygc root 2460 0.0 0.0 0 0 ? S 08:14 0:00 [bch-copygc/ram0] root 4724 0.0 0.0 7976 1792 pts/2 S+ 08:28 0:00 grep --color=auto copygc

This time I was able to reproduce it quickly (with current Linus Torvalds git tip):

!/bin/bash

set -x

uname -a

while [ "$(df -k /mnt | tail -1 | awk '{print $4}')" -ge 21553152 ]; do date fio --ioengine=libaio --filesize=1G --ramp_time=2s --runtime=1m --numjobs=16 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=write --blocksize=1m --iodepth=64 --readwrite=write --unlink=1 fio --ioengine=libaio --filesize=1G --ramp_time=2s --runtime=1m --numjobs=16 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=randwrite --blocksize=4k --iodepth=256 --readwrite=randwrite --unlink=1 fio --ioengine=libaio --filesize=1G --ramp_time=2s --runtime=1m --numjobs=16 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=read --blocksize=1m --iodepth=64 --readwrite=read --unlink=1 fio --ioengine=libaio --filesize=1G --ramp_time=2s --runtime=1m --numjobs=16 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=randread --blocksize=4k --iodepth=256 --readwrite=randread --unlink=1

fio --ioengine=libaio --filesize=32G --ramp_time=2s --runtime=1m --numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=write --blocksize=1m --iodepth=1 --readwrite=write --unlink=1 fio --ioengine=libaio --filesize=32G --ramp_time=2s --runtime=1m --numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=randwrite --blocksize=4k --iodepth=1 --readwrite=randwrite --unlink=1 fio --ioengine=libaio --filesize=16G --ramp_time=2s --runtime=1m --numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=read --blocksize=1m --iodepth=1 --readwrite=read --unlink=1 fio --ioengine=libaio --filesize=16G --ramp_time=2s --runtime=1m --numjobs=1 --direct=1 --verify=0 --randrepeat=0 --group_reporting --directory=/mnt --name=randread --blocksize=4k --iodepth=1 --readwrite=randread --unlink=1 df -h ps axuwww sleep 1

done;

Log file: https://tg.st/u/05b1816b3a6f97dff582a9a947a44e288f3bd545fefee37cbcefc1f66b895b6e.fio.log.zst

This time when the script aborted, there were some files lying around (they should not since fio is instructed to delete them). However once I deleted them manually, the space did no free up:

Hey thanks for uncovering this.

It's going to be a bit before I can dig in on this one; I'm in the middle of a bunch of low level btree iterator work. But, if you could turn this into a pass/fail ktest test (looks like you're basically there) that's the perfect way to turn this into a todo list item.

Cheers, Kent

(midi) [~] df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/ram0 42G 42G 0 100% /mnt (midi) [~] ls -al /mnt/ total 4 drwxr-xr-x 3 root root 0 Apr 13 08:23 . drwxr-xr-x 22 root root 4096 Mar 25 19:44 .. drwx------ 2 root root 0 Apr 13 08:14 lost+found -rw-r--r-- 1 root root 17179869184 Apr 13 08:23 randread.0.0 -rw-r--r-- 1 root root 17179869184 Apr 13 08:23 read.0.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.1.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.10.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.2.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.3.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.4.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.5.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.6.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.7.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.8.0 -rw-r--r-- 1 root root 1073741824 Apr 13 08:22 read.9.0 (midi) [~] rm /mnt/r* (midi) [~] df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/ram0 42G 42G 0 100% /mnt (midi) [~] ps axuwww | grep copygc root 2460 0.0 0.0 0 0 ? S 08:14 0:00 [bch-copygc/ram0] root 4684 0.0 0.0 7976 1792 pts/2 S+ 08:26 0:00 grep --color=auto copygc

Also: unmount / mounting releases the space:

(midi) [~] df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/ram0 42G 42G 0 100% /mnt (midi) [~] find /mnt /mnt /mnt/lost+found (midi) [~] umount /mnt (midi) [~] mount /dev/ram0 /mnt bcachefs (/dev/nvme0n1): error reading default superblock: Not a bcachefs superblock (got magic 00000000-0000-0000-0000-000000000000) bcachefs (/dev/nvme0n1): error reading superblock: Not a bcachefs superblock layout bcachefs (/dev/nvme0n1p1): error reading default superblock: Not a bcachefs superblock (got magic 00000000-0000-0000-0000-000000000000) bcachefs (/dev/nvme0n1p1): error reading superblock: Not a bcachefs superblock layout bcachefs (/dev/nvme0n1p2): error reading default superblock: Not a bcachefs superblock (got magic 3d405d4b-dd1f-d494-0000-000000000000) bcachefs (/dev/nvme0n1p2): error reading superblock: Not a bcachefs superblock layout bcachefs (/dev/nvme0n1p3): error reading default superblock: Not a bcachefs superblock (got magic ef59b91d-e41f-55c1-0000-000000000000) bcachefs (/dev/nvme0n1p3): error reading superblock: Not a bcachefs superblock layout bcachefs (/dev/loop0): error reading default superblock: IO error: -5 bcachefs (/dev/loop0): error reading superblock: IO error: -5 bcachefs (/dev/loop1): error reading default superblock: IO error: -5 bcachefs (/dev/loop1): error reading superblock: IO error: -5 bcachefs (/dev/loop2): error reading default superblock: IO error: -5 bcachefs (/dev/loop2): error reading superblock: IO error: -5 bcachefs (/dev/loop3): error reading default superblock: IO error: -5 bcachefs (/dev/loop3): error reading superblock: IO error: -5 bcachefs (/dev/loop4): error reading default superblock: IO error: -5 bcachefs (/dev/loop4): error reading superblock: IO error: -5 bcachefs (/dev/loop5): error reading default superblock: IO error: -5 bcachefs (/dev/loop5): error reading superblock: IO error: -5 bcachefs (/dev/loop6): error reading default superblock: IO error: -5 bcachefs (/dev/loop6): error reading superblock: IO error: -5 bcachefs (/dev/loop7): error reading default superblock: IO error: -5 bcachefs (/dev/loop7): error reading superblock: IO error: -5 (midi) [~] df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/ram0 42G 316M 41G 1% /mnt (midi) [~] find /mnt /mnt /mnt/lost+found

Cheers, Thomas

sithglan commented 5 months ago

Hello Kent,

find attached a patch. Note: I'm currently unable to reproduce the issue with the default ktest BCACHEFS kernel configuration options. So I had to override them. I try to identify the kernel option which lets me reproduce the issue easily and let you know as soon as I find it.

Cheers, Thomas

sithglan commented 5 months ago

Hello Kent, I can only reproduce the issue with CONFIG_BCACHEFS_NO_LATENCY_ACCT=y set. Find another patch attached to document that.

Cheers, Thomas

sithglan commented 5 months ago

Here are the patches for github visibility: https://upload.glanzmann.de/0001-Reproducer-for-https-github.com-koverstreet-bcachefs.patch https://upload.glanzmann.de/0001-Document-the-bcache-kernel-configuration-options-to-.patch

sithglan commented 5 months ago

On Sun, Apr 14, 2024 at 12:09:14PM +0200, Thomas Glanzmann wrote:

Hello Kent, I can only reproduce the issue with CONFIG_BCACHEFS_NO_LATENCY_ACCT=y set. Find another patch attached to document that.

How long does it take to repro/what's the expected test output? I'm not seeing any failures yet.

sithglan commented 5 months ago

On Sun, Apr 14, 2024 at 12:09:14PM +0200, Thomas Glanzmann wrote:

Hello Kent, I can only reproduce the issue with CONFIG_BCACHEFS_NO_LATENCY_ACCT=y set. Find another patch attached to document that.

While the test was running, did you check bcachefs fs usage output? Did online reserved look off?

sithglan commented 5 months ago

Hello Kent,

How long does it take to repro/what's the expected test output? I'm not seeing any failures yet.

less than an hour. Often less than 20 minutes. Approx 22 fio calls. I'll now run it and send you the full output.

Cheers, Thomas

sithglan commented 5 months ago

Hello Kent,

While the test was running, did you check bcachefs fs usage output? Did online reserved look off?

no, I did not. Just 'df -h'. But I'll now adopt the ktest, run it and send you the output. One thing I did notice which was strange:

I now run the 'df -h' after each fio. And at some point 'df -h' says all space is used, but the filesystem contains no files. But than another fio runs and can write 16 GB of data even if it should not be able to do that. After that other fio run 'df -h' says the filesystem is still full. And the next fio can write like 1-3 GB and than the I/O errors happen due to no space available. Send you a log shortly.

Cheers, Thomas

sithglan commented 5 months ago

Hello Kent, as discussed on IRC with v6.9-rc4 it now gets harder for me to hit it. I started three test runs:

Cheers, Thomas

sithglan commented 5 months ago

Small summary for the issue. The error gets harder to hit. It takes now sometimes several hours. According to Kent the issue is that both data and reservered underflow. This will be fixed in a future rewrite. The reproducer was not added to the ktest due to long runtime. Here are the two reproducers that I have (one for ktest and one for a physical machine):

https://tg.st/u/fio-stress.ktest https://tg.st/u/fio-stress.sh