pkoutoupis / rapiddisk

An Advanced Linux RAM Drive and Caching kernel modules. Dynamically allocate RAM as block devices. Use them as stand alone drives or even map them as caching nodes to slower local disk drives. Access those volumes locally or export them across an NVMe Target network. Manage it all from a web API.
http://www.rapiddisk.org
GNU General Public License v2.0
298 stars 49 forks source link

v9.1 seems to destroy btrfs #178

Closed tobwen closed 1 year ago

tobwen commented 1 year ago

summary

After unmounting rapiddisk, btrfs isn't detected anymore. Maybe I'm doing it wrong?

steps

# create btrfs
mkfs.btrfs -L data -m raid1 -d raid1 -m raid1 -d raid1 \
/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0436870-part1 \
/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0532226-part1

# the root volume has this UUID
UUID="7a8f8635-b9ce-4b00-b1ea-f3e4ca1db5fa"

# load rapiddisk modules
modprobe -q rapiddisk
modprobe -q rapiddisk-cache

# create ramdisk
rapiddisk -a 20000

# create cache with writethrough
rapiddisk -m rd0 -b /dev/disk/by-uuid/"$UUID" -p wt

# mount cache
## mount /dev/disk/by-id/dm-name-rc-wt_"$UUID" /mnt/test
mount /dev/dm-0 /mnt/test

# remove cache
rapiddisk -u rc-wt_"$UUID"

# remove ramdisk
rapiddisk -d rd0

issue

When trying to mount btrfs the normal way, the filesystem isn't found anymore.

mount -t btrfs UUID="7a8f8635-b9ce-4b00-b1ea-f3e4ca1db5fa" /mnt/btrfs

versions

matteotenca commented 1 year ago

Hello,

I reproduced your scenario under Ubuntu Jammy Kernel 6.1.0-060100-generic. It seems that installing the current main branch version solves the issue on my test rig, but here @pkoutoupis is needed, since I can't tell you why.

If you can, give the current master branch a try.

Regards

pkoutoupis commented 1 year ago

That is an interesting issue because no real kernel module changes were made since the release of 9.1.0 and the couple of commits that were pushed only affect 5.14/5.15 kernel and RHEL builds.

# mount cache
## mount /dev/disk/by-id/dm-name-rc-wt_"$UUID" /mnt/test
mount /dev/dm-0 /mnt/test

# remove cache
rapiddisk -u rc-wt_"$UUID"

# remove ramdisk
rapiddisk -d rd0

Maybe you forgot to include a step but why is the cache mapping removed prior to a umount?

Also, is this failure consistent with another secondary device?

tobwen commented 1 year ago

Maybe you forgot to include a step but why is the cache mapping removed prior to a umount?

I just forgot to copy & paste this step.

Also, is this failure consistent with another secondary device?

If I have understood you correctly, I could try it with images... So it's reproducable automatically.

matteotenca commented 1 year ago

Hello,

I wrote a script which performs all the operations, from the creation of the array to the last mount try which fails. I found out that:

  1. The problem doesn't happen all the time.
  2. Every time the final mount operation (e.g. mount -t btrfs UUID="550c28a4-45bf-4b1d-9e68-a45486d674b6" /mnt/btrfs) fails, this messages are logged in kern.log (note the reference to the "missing" UUID):
    kernel: [ 1268.741150] BTRFS info (device sdc1): using crc32c (crc32c-intel) checksum algorithm
    kernel: [ 1268.741159] BTRFS info (device sdc1): using free space tree
    kernel: [ 1268.742803] BTRFS error (device sdc1): devid 1 uuid cea32139-09ed-4e19-bc4b-a55ba5e6b106 is missing
    kernel: [ 1268.742834] BTRFS error (device sdc1): failed to read chunk tree: -2
    kernel: [ 1268.743330] BTRFS error (device sdc1): open_ctree failed
  3. Even if the mount operation using the UUID fails, a mount operation which uses one of the array members works. I made my tests using two disks, /dev/sdb1 and /dev/sdc1. When mounting the RAID via UUID fails, mounting it via mount -t btrfs /dev/sdb1 /mnt/btrfs or mount -t btrfs /dev/sdc1 /mnt/btrfs works - everytime. I don't know why. The filesystem is ok and files on it are sane, checked with md5sum.

Just FYI!

Regards

Augusto7743 commented 1 year ago

I see here an similar problem using 9.1 in kernel 6.2.12. I have configured the rapiddisk boot script to create 3 writeback caches for 3 btrfs volumes being sda3 root, sda4 home and sda5 opt. When booting randomly the OS can't mount the /opt partition using wb cache. I need to reset to boot correctly. Also randomly when doing an OS shutdown with correct wb flush in all rapiddisk caches in next boot the /opt partition btrfs can't mounted being showed message filesystem is damged. The solution is remove /opt rapiddisk cache and run the commands

sudo mount -o ro,usebackuproot /dev/sda5 /mnt/ or sudo btrfs rescue zero-log /dev/sda5

After unmount the sda5 partition and an OS reset ... the sda5 /opt partition will be mounted and used correctly in rapiddisk wb cache. That problem is recent ... never had happened before ... possibly is related with kernel 6.X new code or new configuration when mounting disk ? However in OS boot if is done an delay in exact moment is being mounted all disk partitions not happen that problem.

I see when creating BTRFS volumes using the format command to use mixed data and metadata happen extremely less damage in BTRFS volumes than using the default btrfs volume creation command not saving metadata in same data file disk area. Also avoid the terrible fatal error "parent transid verify failed".

pkoutoupis commented 1 year ago

Just out of curiosity, if you capture the blkid output before and after the failure, are they consistent (i.e. the UUID)?

Also when you run:

# mount cache
## mount /dev/disk/by-id/dm-name-rc-wt_"$UUID" /mnt/test
mount /dev/dm-0 /mnt/test

and df -t btrfs, is the volume recognized as a btrfs volume?

pkoutoupis commented 1 year ago

@tobwen and @matteotenca:

OK. So, I attempted to reproduce this on Ubuntu server 23.04.1 (which is what I have installed) and unfortunately (OR fortunately), I am unable to reproduce.

$ uname -a
Linux ubu2310 6.2.0-34-generic #34-Ubuntu SMP PREEMPT_DYNAMIC Mon Sep  4 13:06:55 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

I created 2 loopback devices and formatted them for btrfs:

$ sudo mkfs.btrfs -L data -m raid1 -d raid1 -m raid1 -d raid1 /dev/loop3 /dev/loop4
btrfs-progs v6.2
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM /dev/loop4 (1.00GiB) ...
Performing full device TRIM /dev/loop3 (1.00GiB) ...
NOTE: several default settings have changed in version 5.15, please make sure
      this does not affect your deployments:
      - DUP for metadata (-m dup)
      - enabled no-holes (-O no-holes)
      - enabled free-space-tree (-R free-space-tree)

Label:              data
UUID:               f8cb4fed-7831-4637-9b28-f3fdbeae5b13
Node size:          16384
Sector size:        4096
Filesystem size:    2.00GiB
Block group profiles:
  Data:             RAID1           102.38MiB
  Metadata:         RAID1           102.38MiB
  System:           RAID1             8.00MiB
SSD detected:       no
Zoned device:       no
Incompat features:  extref, skinny-metadata, no-holes
Runtime features:   free-space-tree
Checksum:           crc32c
Number of devices:  2
Devices:
   ID        SIZE  PATH
    1     1.00GiB  /dev/loop3
    2     1.00GiB  /dev/loop4

Verified:

$ ls /dev/disk/by-uuid/f8cb4fed-7831-4637-9b28-f3fdbeae5b13
/dev/disk/by-uuid/f8cb4fed-7831-4637-9b28-f3fdbeae5b13
$ sudo blkid /dev/loop3
/dev/loop3: LABEL="data" UUID="f8cb4fed-7831-4637-9b28-f3fdbeae5b13" UUID_SUB="343a6972-7195-4d9c-8ba3-cda792a7e683" BLOCK_SIZE="4096" TYPE="btrfs"
$ sudo blkid /dev/loop4
/dev/loop4: LABEL="data" UUID="f8cb4fed-7831-4637-9b28-f3fdbeae5b13" UUID_SUB="594cee0c-9fb1-404c-8a5b-f8fa79a7905e" BLOCK_SIZE="4096" TYPE="btrfs"

Mounted and verified:

$ sudo mount /dev/disk/by-uuid/f8cb4fed-7831-4637-9b28-f3fdbeae5b13 /mnt/
$ df -t btrfs
Filesystem     1K-blocks  Used Available Use% Mounted on
/dev/loop3       1048576  3728    934528   1% /mnt

Syslog:

2023-10-13T15:26:44.565626+00:00 ubu2304 kernel: [ 1051.104689] loop3: detected capacity change from 0 to 2097152
2023-10-13T15:26:49.389629+00:00 ubu2304 kernel: [ 1055.929343] loop4: detected capacity change from 0 to 2097152
2023-10-13T15:28:00.449578+00:00 ubu2304 kernel: [ 1126.986138] BTRFS: device label data devid 1 transid 6 /dev/loop3 scanned by mkfs.btrfs (2472)
2023-10-13T15:28:00.449590+00:00 ubu2304 kernel: [ 1126.986278] BTRFS: device label data devid 2 transid 6 /dev/loop4 scanned by mkfs.btrfs (2472)
2023-10-13T15:31:26.269706+00:00 ubu2304 kernel: [ 1332.803578] BTRFS info (device loop3): using crc32c (crc32c-intel) checksum algorithm
2023-10-13T15:31:26.269750+00:00 ubu2304 kernel: [ 1332.803586] BTRFS info (device loop3): using free space tree
2023-10-13T15:31:26.273615+00:00 ubu2304 kernel: [ 1332.807624] BTRFS info (device loop3): auto enabling async discard
2023-10-13T15:31:26.273630+00:00 ubu2304 kernel: [ 1332.807942] BTRFS info (device loop3): checking UUID tree

When I umount, the UUIDs are consistent (checked with blkid). Anyway, I created a rapiddisk RAM drive and mapped it to one of the loopback volumes:

$ sudo ./rapiddisk -l
rapiddisk 9.1.0
Copyright 2011 - 2023 Petros Koutoupis

List of RapidDisk device(s):

 RapidDisk Device 1: rd0        Size (KB): 1048576      Usage (KB): 1044        Status: Unlocked

List of RapidDisk-Cache mapping(s):

 RapidDisk-Cache Target 1: rc-wt_f8cb4fed-7831-4637-9b28-f3fdbeae5b13   Cache: rd0  Target: loop3 (WRITE THROUGH)

And:

$ ls -l /dev/mapper/
total 0
crw------- 1 root root 10, 236 Oct 13 15:09 control
lrwxrwxrwx 1 root root       7 Oct 13 15:34 rc-wt_f8cb4fed-7831-4637-9b28-f3fdbeae5b13 -> ../dm-1
lrwxrwxrwx 1 root root       7 Oct 13 15:09 ubuntu--vg-ubuntu--lv -> ../dm-0

UUID is the same after mapping and mounting:

$ sudo mount /dev/dm-1 /mnt/
$ df -t btrfs
Filesystem     1K-blocks  Used Available Use% Mounted on
/dev/dm-1        1048576  3728    934528   1% /mnt
$ blkid
/dev/mapper/ubuntu--vg-ubuntu--lv: UUID="c1d274d6-a690-49cc-b66d-537e1fda1739" BLOCK_SIZE="4096" TYPE="ext4"
/dev/sda2: UUID="f658e69d-0ee1-4760-876a-60bc93f31256" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="c7303a2e-837c-451e-a972-e48ee9609cef"
/dev/sda3: UUID="nBo1gb-s10F-Epi0-HznV-OIuD-rGjQ-RAt43O" TYPE="LVM2_member" PARTUUID="e44d52dc-8756-4d5e-998d-765ca7209a84"
/dev/loop3: LABEL="data" UUID="f8cb4fed-7831-4637-9b28-f3fdbeae5b13" UUID_SUB="343a6972-7195-4d9c-8ba3-cda792a7e683" BLOCK_SIZE="4096" TYPE="btrfs"
/dev/loop4: LABEL="data" UUID="f8cb4fed-7831-4637-9b28-f3fdbeae5b13" UUID_SUB="594cee0c-9fb1-404c-8a5b-f8fa79a7905e" BLOCK_SIZE="4096" TYPE="btrfs"
/dev/loop1: TYPE="squashfs"
/dev/loop2: TYPE="squashfs"
/dev/loop0: TYPE="squashfs"

Syslog still looks normal and good. I even wrote some I/O to the mountpoint:

$ sudo dd if=/dev/random of=/mnt/test.dat bs=1M count=32
32+0 records in
32+0 records out
33554432 bytes (34 MB, 32 MiB) copied, 0.0984983 s, 341 MB/s
$ sudo touch /mnt/hello.txt
$ ls /mnt/
hello.txt  test.dat

And then unmounted, unmapped and removed the RAM drive. UUIDs still look good:

$ blkid
/dev/mapper/ubuntu--vg-ubuntu--lv: UUID="c1d274d6-a690-49cc-b66d-537e1fda1739" BLOCK_SIZE="4096" TYPE="ext4"
/dev/sda2: UUID="f658e69d-0ee1-4760-876a-60bc93f31256" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="c7303a2e-837c-451e-a972-e48ee9609cef"
/dev/sda3: UUID="nBo1gb-s10F-Epi0-HznV-OIuD-rGjQ-RAt43O" TYPE="LVM2_member" PARTUUID="e44d52dc-8756-4d5e-998d-765ca7209a84"
/dev/loop3: LABEL="data" UUID="f8cb4fed-7831-4637-9b28-f3fdbeae5b13" UUID_SUB="343a6972-7195-4d9c-8ba3-cda792a7e683" BLOCK_SIZE="4096" TYPE="btrfs"
/dev/loop4: LABEL="data" UUID="f8cb4fed-7831-4637-9b28-f3fdbeae5b13" UUID_SUB="594cee0c-9fb1-404c-8a5b-f8fa79a7905e" BLOCK_SIZE="4096" TYPE="btrfs"
/dev/loop1: TYPE="squashfs"
/dev/loop2: TYPE="squashfs"
/dev/loop0: TYPE="squashfs"

Then I remount the filesystem by the normal loopback device two different ways with its UUID:

$ sudo mount /dev/disk/by-uuid/f8cb4fed-7831-4637-9b28-f3fdbeae5b13 /mnt/
<unmount and remounted>
$ sudo mount -t btrfs UUID="f8cb4fed-7831-4637-9b28-f3fdbeae5b13" /mnt/

Again, no issues.

Note that I am using btrfs-progs version 6.2-1:

$ sudo aptitude show btrfs-progs
Package: btrfs-progs
Version: 6.2-1

After some digging, I did find posts complaining about missing sub-UUID which existed in btrfs-progs versions 6.0 and early 6.1. This Bugzilla link shows that one related bug was addressed by RedHat in version 6.1.2: https://bugzilla.redhat.com/show_bug.cgi?id=2156710.

Anyway, which version are you running and if you rerun the same experiment on a newer version, does it still occur?

Note - I also believe that @Augusto7743's issue is not exactly related to this one.

matteotenca commented 1 year ago

Hi @pkoutoupis and @tobwen ,

I ended out writing a bash script (attached) which tries to reproduce the issue. The script does the following:

1) creates a Rapiddisk ramdisk 2) creates a BTRFS RAID1 volume spanned on two devices, gathering its UUID 3) creates a rapiddisk mapping to that volume by UUID, mode write-through 4) mounts the mapping 5) creates a file on it using dd from /dev/urandom 6) creates a md5 hash of that file and saves it in /tmp 7) unmounts the mapping 8) deletes the ramdisk 9) tries to remount the array using the UUID gathered at step 2 10) if 9. is a success, checks the sanity of the file 11) if 9. fails, tries to remount the array using the name of the first of the two devices the array is using 12) if 11. is a success, checks the sanity of the file 13) if 11. fails, tries to remount the array using the name of the second of the two devices the array is using 14) if 13. is a success, checks the sanity of the file

The script needs and accepts some arguments. It performs the desidered amount of runs, each run performs from 1) to 14).

What I discovered is that, no matter the kernel version, upon many runs, sometimes the remount operation of the array goes well by referring to it using its UUID, some others referring to it by the name of the first device it is made of, some others referring to it by the name of the second device it is made of - it always remounts, at the end, and the file is ok.

I do believe this is something related to BTRFS.

Test machine is a virtual one under Virtualbox, Windows host, Ubuntu server 22.04.3 LTS.

Script usage:

shub@ubuserver:~$ sudo ./test_btrfs.sh
Rapiddisk BTRFS RAID test script
Usage: ./test_btrfs.sh [-r|--ramdisk <STR>] [-s|--size <INT>] [--(no-)quiet-rapid] [--(no-)quiet-btrfs] [--(no-)quiet-messages] [--(no-)skip-sleep] [-h|--help] [-v|--version] <runs> <device-one> <device-two> <mount-point>
        <runs>: the number of runs
        <device-one>: device 1, es /dev/sda1
        <device-two>: device 2, es /dev/sdb1
        <mount-point>: mount point that will be used
        -r, --ramdisk: the rapiddisk ramdisk name (default: 'rd0')
        -s, --size: the ramdisk size in MB (default: '100')
        --quiet-rapid, --no-quiet-rapid: silent rapiddisk commands (off by default)
        --quiet-btrfs, --no-quiet-btrfs: silent btrfs commands (off by default)
        --quiet-messages, --no-quiet-messages: silent all progress informations (off by default)
        --skip-sleep, --no-skip-sleep: sleep for 0.1 sec only vs 2 secs (off by default)
        -h, --help: Prints help
        -v, --version: Prints version
FATAL ERROR: Not enough positional arguments - we require exactly 4 (namely: 'runs', 'device-one', 'device-two' and 'mount-point'), but got only 0.
shub@ubuserver:~$

Examples:

shub@ubuserver:~$ btrfs --version
btrfs-progs v5.16.2
shub@ubuserver:~$ uname -a
Linux ubuserver 6.1.0-060100-generic #202303090726 SMP PREEMPT_DYNAMIC Thu Mar  9 12:33:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
shub@ubuserver:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy
shub@ubuserver:~$

Single run, verbose, devices /dev/sdb1 and /dev/sdc1, 1000 MB ramdisk size, expected usable created ramdisk name rd0, mount dir /mnt/test:

shub@ubuserver:~$ sudo ./test_btrfs.sh -r rd0 -s 1000 1 /dev/sdb1 /dev/sdc1 /mnt/test/
******************* Starting run 1 ************************
Creating ramdisk of 1000 MB...
rapiddisk 9.1.0
Copyright 2011 - 2023 Petros Koutoupis

Attached device rd0 of size 1000 Mbytes.
Issuing command mkfs.btrfs -f -L data -m raid1 -d raid1 /dev/sdb1 /dev/sdc1...
btrfs-progs v5.16.2
See http://btrfs.wiki.kernel.org for more information.

NOTE: several default settings have changed in version 5.15, please make sure
      this does not affect your deployments:
      - DUP for metadata (-m dup)
      - enabled no-holes (-O no-holes)
      - enabled free-space-tree (-R free-space-tree)

Label:              data
UUID:               f2386eea-57f8-4ff2-94ed-adcec1b1c52a
Node size:          16384
Sector size:        4096
Filesystem size:    2.21GiB
Block group profiles:
  Data:             RAID1           113.12MiB
  Metadata:         RAID1           113.12MiB
  System:           RAID1             8.00MiB
SSD detected:       yes
Zoned device:       no
Incompat features:  extref, skinny-metadata, no-holes
Runtime features:   free-space-tree
Checksum:           crc32c
Number of devices:  2
Devices:
   ID        SIZE  PATH
    1     1.10GiB  /dev/sdb1
    2     1.10GiB  /dev/sdc1
Sleeping 3 secs...
Creating mapping: rapiddisk -m rd0 -b /dev/disk/by-uuid/f2386eea-57f8-4ff2-94ed-adcec1b1c52a -p wt
rapiddisk 9.1.0
Copyright 2011 - 2023 Petros Koutoupis

Command to map rc-wt_f2386eea-57f8-4ff2-94ed-adcec1b1c52a with rd0 and /dev/disk/by-uuid/f2386eea-57f8-4ff2-94ed-adcec1b1c52a has been sent.
Mounting mapping /dev/mapper/rc-wt_f2386eea-57f8-4ff2-94ed-adcec1b1c52a onto /mnt/test/...
Creating testfile...
300+0 records in
300+0 records out
314572800 bytes (315 MB, 300 MiB) copied, 1.61921 s, 194 MB/s
Creating checksum file...
Unmounting...
Sleeping 3 secs
Removing mapping and ramdisk...
rapiddisk 9.1.0
Copyright 2011 - 2023 Petros Koutoupis

Command to unmap rc-wt_f2386eea-57f8-4ff2-94ed-adcec1b1c52a has been sent.
rapiddisk 9.1.0
Copyright 2011 - 2023 Petros Koutoupis

Detached device rd0.
Sleeping 3 secs
Trying to remount RAID...
FAILED to mount via UUID
Mounted via /dev/sdb1 successfull!
Testing MD5 checksum...
Unmounting...

UUID remounts (success/failed/total runs):              0/1/1
/dev/sdb1 remounts (success/failed/total runs):         1/0/1
/dev/sdc1 remounts (success/failed/total runs):         0/0/1
Total success: 1/1
shub@ubuserver:~$

Five runs, quiet, devices /dev/sdb1 and /dev/sdc1, 1000 MB ramdisk size, expected usable created ramdisk name rd0, mount dir /mnt/test, fast mode (sleep times from 3 to 0.1 secs):

shub@ubuserver:~$ sudo ./test_btrfs.sh -r rd0 -s 1000 --quiet-rapid --quiet-btrfs --quiet-messages --skip-sleep 5 /dev/sdb1 /dev/sdc1 /mnt/test/
******************* Starting run 1 ************************
******************* Starting run 2 ************************
******************* Starting run 3 ************************
******************* Starting run 4 ************************
******************* Starting run 5 ************************

UUID remounts (success/failed/total runs):              4/1/5
/dev/sdb1 remounts (success/failed/total runs):         1/0/5
/dev/sdc1 remounts (success/failed/total runs):         0/0/5
Total success: 5/5
shub@ubuserver:~$

test_btrfs.sh.gz

pkoutoupis commented 1 year ago

My suspicion is btrfs-progs. I will run your script on Monday but if you have the time, see if you can reproduce in 23.04.1. It has a newer version of btrfs-progs that fixes known missing UUID errors.

Thanks!!!

pkoutoupis commented 1 year ago

@tobwen and @matteotenca:

Running @matteotenca's script on Ubuntu 23.04.1 with the newer btrfs-progs (and kernel/driver):

petros@ubu2310:~/misc$ sudo ./test_btrfs.sh -r rd0 -s 900 --quiet-rapid --quiet-btrfs --quiet-messages 30 /dev/loop3 /dev/loop4 /mnt/
******************* Starting run 1 ************************
******************* Starting run 2 ************************
******************* Starting run 3 ************************
******************* Starting run 4 ************************
******************* Starting run 5 ************************
******************* Starting run 6 ************************
******************* Starting run 7 ************************
******************* Starting run 8 ************************
******************* Starting run 9 ************************
******************* Starting run 10 ************************
******************* Starting run 11 ************************
******************* Starting run 12 ************************
******************* Starting run 13 ************************
******************* Starting run 14 ************************
******************* Starting run 15 ************************
******************* Starting run 16 ************************
******************* Starting run 17 ************************
******************* Starting run 18 ************************
******************* Starting run 19 ************************
******************* Starting run 20 ************************
******************* Starting run 21 ************************
******************* Starting run 22 ************************
******************* Starting run 23 ************************
******************* Starting run 24 ************************
******************* Starting run 25 ************************
******************* Starting run 26 ************************
******************* Starting run 27 ************************
******************* Starting run 28 ************************
******************* Starting run 29 ************************
******************* Starting run 30 ************************

UUID remounts (success/failed/total runs):      30/0/30
/dev/loop3 remounts (success/failed/total runs):        0/0/30
/dev/loop4 remounts (success/failed/total runs):        0/0/30
Total success: 30/30

I ran this over and over again and have yet to reproduce the issue.

EDIT: I will rerun on 22.04.2.

pkoutoupis commented 1 year ago

Here are my results from 22.04.2 using the exact same version of RapidDisk:

petros@ubu2204:~/test$ sudo ./test_btrfs.sh -r rd0 -s 900 --quiet-rapid --quiet-btrfs --quiet-messages 30 /dev/loop3 /dev/loop4 /mnt/
******************* Starting run 1 ************************
******************* Starting run 2 ************************
******************* Starting run 3 ************************
******************* Starting run 4 ************************
******************* Starting run 5 ************************
******************* Starting run 6 ************************
******************* Starting run 7 ************************
******************* Starting run 8 ************************
******************* Starting run 9 ************************
******************* Starting run 10 ************************
******************* Starting run 11 ************************
******************* Starting run 12 ************************
******************* Starting run 13 ************************
******************* Starting run 14 ************************
******************* Starting run 15 ************************
******************* Starting run 16 ************************
******************* Starting run 17 ************************
******************* Starting run 18 ************************
******************* Starting run 19 ************************
******************* Starting run 20 ************************
******************* Starting run 21 ************************
******************* Starting run 22 ************************
******************* Starting run 23 ************************
******************* Starting run 24 ************************
******************* Starting run 25 ************************
******************* Starting run 26 ************************
******************* Starting run 27 ************************
******************* Starting run 28 ************************
******************* Starting run 29 ************************
******************* Starting run 30 ************************

UUID remounts (success/failed/total runs):      6/24/30
/dev/loop3 remounts (success/failed/total runs):        24/0/30
/dev/loop4 remounts (success/failed/total runs):        0/0/30
pkoutoupis commented 1 year ago

Hmmm. It may not be btrfs-progs specifically but may also involve the driver or a combination of both. I installed the latest release of btrfs-progs on 22.04.2:

petros@ubu2204:~/test$ /home/petros/btrfs-progs-6.5.2/btrfs --version
btrfs-progs v6.5.2

And modified the script to use it:

$ sudo ./test_updated-btrfs.sh -r rd0 -s 900 --quiet-rapid --quiet-btrfs --quiet-messages 30 /dev/loop3 /dev/loop4 /mnt/
******************* Starting run 1 ************************
******************* Starting run 2 ************************
******************* Starting run 3 ************************
******************* Starting run 4 ************************
******************* Starting run 5 ************************
******************* Starting run 6 ************************
******************* Starting run 7 ************************
******************* Starting run 8 ************************
******************* Starting run 9 ************************
******************* Starting run 10 ************************
******************* Starting run 11 ************************
******************* Starting run 12 ************************
******************* Starting run 13 ************************
******************* Starting run 14 ************************
******************* Starting run 15 ************************
******************* Starting run 16 ************************
******************* Starting run 17 ************************
******************* Starting run 18 ************************
******************* Starting run 19 ************************
******************* Starting run 20 ************************
******************* Starting run 21 ************************
******************* Starting run 22 ************************
******************* Starting run 23 ************************
******************* Starting run 24 ************************
******************* Starting run 25 ************************
******************* Starting run 26 ************************
******************* Starting run 27 ************************
******************* Starting run 28 ************************
******************* Starting run 29 ************************
******************* Starting run 30 ************************

UUID remounts (success/failed/total runs):      1/29/30
/dev/loop3 remounts (success/failed/total runs):        29/0/30
/dev/loop4 remounts (success/failed/total runs):        0/0/30

Same results. Some more details:

[10620.062876] rapiddisk: Detached rd0.
[10623.081991] BTRFS info (device loop4): using crc32c (crc32c-intel) checksum algorithm
[10623.081998] BTRFS info (device loop4): using free space tree
[10623.082000] BTRFS info (device loop4): has skinny extents
[10623.084402] BTRFS error (device loop4): devid 1 uuid 9d8e7eb2-b906-47e3-b71f-481c25bcf3de is missing
[10623.085824] BTRFS error (device loop4): failed to read chunk tree: -2
[10623.105379] BTRFS error (device loop4): open_ctree failed
[10623.121758] BTRFS info (device loop3): using crc32c (crc32c-intel) checksum algorithm
[10623.121765] BTRFS info (device loop3): using free space tree
[10623.121766] BTRFS info (device loop3): has skinny extents
petros@ubu2204:~/test$ sudo blkid /dev/loop3
/dev/loop3: LABEL="data" UUID="a91ac714-853e-4adb-af7c-354af3ed0d6a" UUID_SUB="9d8e7eb2-b906-47e3-b71f-481c25bcf3de" BLOCK_SIZE="4096" TYPE="btrfs"

BUT when I manually mount by the UUID, no problems:

petros@ubu2204:~/test$ sudo mount -t btrfs UUID="a91ac714-853e-4adb-af7c-354af3ed0d6a" /mnt
petros@ubu2204:~/test$

And dmesg shows no errors:

[10837.427957] BTRFS info (device loop3): using crc32c (crc32c-intel) checksum algorithm
[10837.427963] BTRFS info (device loop3): using free space tree
[10837.427964] BTRFS info (device loop3): has skinny extents

I am convinced that this is related to an older version of btrfs but not sure what it is exactly.

matteotenca commented 1 year ago

@pkoutoupis For sure, no data corruption ever happens: if the UUID mount fails, one of the others always succeedes, and the file is always ok.

I wrote a new script, give it a try, it creates a verbose log file in /tmp, it is possible to tune many variables from the command line, perform multiple mount/check/unmount cycles per run and even exclude rapiddisk from the equation - a btrfs-pure test.

Anyway, under Ubuntu 23.04 Lunar and btrfs v6.2 no errors, while under Ubuntu 22.04.3 Jammy and btrfs v5.16.2 there are some.

shub@rap:~$ lsb_release -a && uname -a && btrfs --version
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 23.04
Release:        23.04
Codename:       lunar
Linux rap 6.2.0-34-generic #34-Ubuntu SMP PREEMPT_DYNAMIC Mon Sep  4 13:06:55 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
btrfs-progs v6.2
shub@rap:~$ sudo ./test_btrfs.sh  -s 1000 -t 10 -w 3 -d -b -q 10 /dev/vdb /dev/vdc /mnt/test
Oct 22 11:37:19 test_btrfs.sh: ============ Starting run 1/10 ============
Oct 22 11:37:32 test_btrfs.sh: ============ Starting run 2/10 ============
Oct 22 11:37:43 test_btrfs.sh: ============ Starting run 3/10 ============
Oct 22 11:37:55 test_btrfs.sh: ============ Starting run 4/10 ============
Oct 22 11:38:07 test_btrfs.sh: ============ Starting run 5/10 ============
Oct 22 11:38:19 test_btrfs.sh: ============ Starting run 6/10 ============
Oct 22 11:38:30 test_btrfs.sh: ============ Starting run 7/10 ============
Oct 22 11:38:42 test_btrfs.sh: ============ Starting run 8/10 ============
Oct 22 11:38:54 test_btrfs.sh: ============ Starting run 9/10 ============
Oct 22 11:39:05 test_btrfs.sh: ============ Starting run 10/10 ============
Oct 22 11:39:17 test_btrfs.sh:
Oct 22 11:39:17 test_btrfs.sh: UUID remounts (success/failed/total runs):               10/0/10
Oct 22 11:39:17 test_btrfs.sh: /dev/vdb remounts (success/failed/total runs):           0/0/10
Oct 22 11:39:17 test_btrfs.sh: /dev/vdc remounts (success/failed/total runs):           0/0/10
Oct 22 11:39:17 test_btrfs.sh: Total successful runs/total runs: 10/10
shub@rap:~$
shub@ubuserver:~$ lsb_release -a && uname -a && btrfs --version
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy
Linux ubuserver 6.1.0-060100-generic #202303090726 SMP PREEMPT_DYNAMIC Thu Mar  9 12:33:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
btrfs-progs v5.16.2
shub@ubuserver:~$ sudo ./test_btrfs.sh -s 1000 -t 10 -w 3 -d -b -q 10 /dev/sdb1 /dev/sdc1 mnt/test                               
Oct 22 11:56:09 test_btrfs.sh: ============ Starting run 1/10 ============
Oct 22 11:56:19 test_btrfs.sh: ============ Starting run 2/10 ============
Oct 22 11:56:29 test_btrfs.sh: ============ Starting run 3/10 ============
Oct 22 11:56:40 test_btrfs.sh: ============ Starting run 4/10 ============
Oct 22 11:56:50 test_btrfs.sh: ============ Starting run 5/10 ============
Oct 22 11:57:01 test_btrfs.sh: ============ Starting run 6/10 ============
Oct 22 11:57:11 test_btrfs.sh: ============ Starting run 7/10 ============
Oct 22 11:57:21 test_btrfs.sh: ============ Starting run 8/10 ============
Oct 22 11:57:32 test_btrfs.sh: ============ Starting run 9/10 ============
Oct 22 11:57:42 test_btrfs.sh: ============ Starting run 10/10 ============
Oct 22 11:57:52 test_btrfs.sh:
Oct 22 11:57:52 test_btrfs.sh: UUID remounts (success/failed/total runs):               5/5/10
Oct 22 11:57:52 test_btrfs.sh: /dev/sdb1 remounts (success/failed/total runs):          1/4/10
Oct 22 11:57:52 test_btrfs.sh: /dev/sdc1 remounts (success/failed/total runs):          4/0/10
Oct 22 11:57:52 test_btrfs.sh: Total successful runs/total runs: 10/10
shub@ubuserver:~$

kern.log:

Oct 22 11:57:32 ubuserver kernel: [ 3395.339015] BTRFS info (device sdb1): using crc32c (crc32c-intel) checksum algorithm
Oct 22 11:57:32 ubuserver kernel: [ 3395.339024] BTRFS info (device sdb1): using free space tree
Oct 22 11:57:32 ubuserver kernel: [ 3395.340833] BTRFS error (device sdb1): devid 2 uuid ed697ca9-58fe-459d-83be-7696ac3f1f3f is missing
Oct 22 11:57:32 ubuserver kernel: [ 3395.341465] BTRFS error (device sdb1): failed to read chunk tree: -2
Oct 22 11:57:32 ubuserver kernel: [ 3395.342245] BTRFS error (device sdb1): open_ctree failed
Oct 22 11:57:32 ubuserver kernel: [ 3395.351002] BTRFS info (device sdb1): using crc32c (crc32c-intel) checksum algorithm
Oct 22 11:57:32 ubuserver kernel: [ 3395.351008] BTRFS info (device sdb1): using free space tree
Oct 22 11:57:32 ubuserver kernel: [ 3395.352683] BTRFS error (device sdb1): devid 2 uuid ed697ca9-58fe-459d-83be-7696ac3f1f3f is missing
Oct 22 11:57:32 ubuserver kernel: [ 3395.353327] BTRFS error (device sdb1): failed to read chunk tree: -2
Oct 22 11:57:32 ubuserver kernel: [ 3395.353697] BTRFS error (device sdb1): open_ctree failed
Oct 22 11:57:32 ubuserver kernel: [ 3395.361509] BTRFS info (device sdb1): using crc32c (crc32c-intel) checksum algorithm
Oct 22 11:57:32 ubuserver kernel: [ 3395.361514] BTRFS info (device sdb1): using free space tree
Oct 22 11:57:32 ubuserver kernel: [ 3395.372425] BTRFS info (device sdb1): enabling ssd optimizations

test_btrfs.sh.gz

pkoutoupis commented 1 year ago

Thank you @matteotenca for spending a considerable amount of time troubleshooting this.

Seeing how newer versions of the kernel module / btrfs-progs does not exhibit the issue coupled with the fact that rapiddisk-cache does not alter on-disk formats (it is pass-through to the underlying volume), I am led to believe that this is a btrfs specific issue and therefore, will close the issue unless future evidence exhibits otherwise.

@tobwen again, thank you for your interest in the project and if you find anything else, please do not hesitate to open up a new ticket.