Git-North commented 1 year ago

UTM Version 4.1.2 (Beta) Ubuntu Version: 23.4 Lunar Lobster Apple Virtualisation With Rosetta 2 enabled

none of the disks are set as "read only" inside UTM it sometimes works for seconds sometimes minutes but it always happens errors usually has something like "error: read-only file system" but it varies from command to command

AkihiroSuda commented 10 months ago

Needs this patch.

Thank you. This doesn't seem cherry-picked to Ubuntu 23.10 (mantic) yet: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/mantic/log/?h=master-next

Has anybody already notified Ubuntu (and other distros) to cherry-pick this?

marcan commented 10 months ago

Those I/O errors sound like a problem specific to virtio/the disk stuff in particular. While the atomics stuff could hypothetically cause that kind of issue in the driver, I would expect it to more likely cause in-memory corruption and general filesystem badness, not outright I/O errors.

If you're seeing the same I/O errors with both virtio and nvme, that's even weirder and starts to sound like the problem is on the hypervisor/host side. If the I/O error actually comes from the host, there should be some way to log it in the hypervisor implementation, since it should be visible to it, and then you can eliminate Linux as a problem for that particular issue.

I pinged Ubuntu folks about that patch.

gnattu commented 10 months ago

Can anyone tell me how to reliably reproduce this error? I have found something that seems to be a workaround, which throttles the cache flush operation of the virtio_blk device so that the operation never occurs more than once per second. By using a kernel with this workaround, I've not seen an IO error in almost 500 hours of uptime. However, my vm workload does not trigger this bug very frequently even without this patch, so I don't know if this is the patch workarounds this bug or it's just me being lucky. The patch looks like this:

--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -31,6 +31,8 @@ static unsigned int num_request_queues;
 #define VIRTIO_BLK_INLINE_SG_CNT   2
 #endif

+#define FLUSH_INTERVAL (msecs_to_jiffies(1000))
+
 static unsigned int num_request_queues;
 module_param(num_request_queues, uint, 0644);
 MODULE_PARM_DESC(num_request_queues,
@@ -84,6 +86,10 @@ struct virtio_blk {

    /* For zoned device */
    unsigned int zone_sectors;
+
+   unsigned long last_flush;
+   struct delayed_work flush_dwork;
+   struct bio flush_bio;
 };

 struct virtblk_req {
@@ -454,6 +460,35 @@ static blk_status_t virtblk_prep_rq(struct blk_mq_hw_c
    blk_mq_start_request(req);

    return BLK_STS_OK;
+
+static bool virtblk_skip_flush(struct virtio_blk *vblk, struct request *req)
+{
+   if (req_op(req) != REQ_OP_FLUSH)
+       return false;
+   if (req->cmd_flags & REQ_DRV)
+       return false;
+   if (delayed_work_pending(&vblk->flush_dwork))
+       return true;
+   if (time_before(jiffies, vblk->last_flush + FLUSH_INTERVAL)) {
+       kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, &vblk->flush_dwork, FLUSH_INTERVAL);
+       return true;
+   }
+
+   vblk->last_flush = jiffies;
+   return false;
+}
+
+static void virtblk_flush_work(struct work_struct *work)
+{
+   struct virtio_blk *vblk = container_of(work, struct virtio_blk, flush_dwork.work);
+   struct bio *bio = &vblk->flush_bio;
+
+   bio_init(bio, vblk->disk->part0, NULL, 0,
+        REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH | REQ_DRV);
+   submit_bio(bio);
+
+   vblk->last_flush = jiffies;
 }

 static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
@@ -472,6 +507,11 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_c
    if (unlikely(status))
        return status;

+   if (virtblk_skip_flush(vblk, req)) {
+       blk_mq_complete_request(req);
+       return BLK_STS_OK;
+   }
+
    spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
    err = virtblk_add_req(vblk->vqs[qid].vq, vbr);
    if (err) {
@@ -517,6 +557,12 @@ static bool virtblk_add_req_batch(struct virtio_blk_vq
    while (!rq_list_empty(*rqlist)) {
        struct request *req = rq_list_pop(rqlist);
        struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
+       struct virtio_blk *vblk = req->mq_hctx->queue->queuedata;
+
+       if (virtblk_skip_flush(vblk, req)) {
+           blk_mq_complete_request(req);
+           continue;
+       }

        err = virtblk_add_req(vq->vq, vbr);
        if (err) {
@@ -1380,7 +1426,7 @@ static int virtblk_probe(struct virtio_device *vdev)
    vblk->tag_set.ops = &virtio_mq_ops;
    vblk->tag_set.queue_depth = queue_depth;
    vblk->tag_set.numa_node = NUMA_NO_NODE;
-   vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+   vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_SCHED_BY_DEFAULT;
    vblk->tag_set.cmd_size =
        sizeof(struct virtblk_req) +
        sizeof(struct scatterlist) * VIRTIO_BLK_INLINE_SG_CNT;
@@ -1573,6 +1619,9 @@ static int virtblk_probe(struct virtio_device *vdev)
        else
            q->limits.discard_granularity = blk_size;
    }
+
+   vblk->last_flush = jiffies - FLUSH_INTERVAL;
+   INIT_DELAYED_WORK(&vblk->flush_dwork, virtblk_flush_work);

    virtblk_update_capacity(vblk, false);
    virtio_device_ready(vdev);
@@ -1614,6 +1663,7 @@ static void virtblk_remove(struct virtio_device *vdev)

    /* Make sure no work handler is accessing the device. */
    flush_work(&vblk->config_work);
+   flush_delayed_work(&vblk->flush_dwork);

    del_gendisk(vblk->disk);
    blk_mq_free_tag_set(&vblk->tag_set);

This should be applicable to kernel version 6.3 and later if you want to test this on your own machine and believe you have a reliable way to reproduce the IO error.

wrmack commented 10 months ago

Needs this patch.

Thank you. This doesn't seem cherry-picked to Ubuntu 23.10 (mantic) yet: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/mantic/log/?h=master-next

Has anybody already notified Ubuntu (and other distros) to cherry-pick this?

Looks like it is in 6.5.6: https://kernel.ubuntu.com/mainline/v6.5.6/CHANGES

Mark Rutland (1): locking/atomic: scripts: fix fallback ifdeffery

AkihiroSuda commented 10 months ago

Needs this patch.

Thank you. This doesn't seem cherry-picked to Ubuntu 23.10 (mantic) yet: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/mantic/log/?h=master-next Has anybody already notified Ubuntu (and other distros) to cherry-pick this?

Looks like it is in 6.5.6: https://kernel.ubuntu.com/mainline/v6.5.6/CHANGES

Mark Rutland (1): locking/atomic: scripts: fix fallback ifdeffery

This is not the default kernel of Ubuntu

marcan commented 10 months ago

Ubuntu bug here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2042573

kurgannet commented 10 months ago

I recently tried installing Debian Sid (Linux Kernel 6.5.0.3) to try a new one and this is why I got just after finishing installation (before rebooting after install completed):

This kernel oops is the one related to file system corruption? Is a different issue? This issue is the one I suffer from the start whatever Linux kernel version I try...

linickx commented 10 months ago

FWIW - Kali Live (arm) with no disks attached seems to run ok.

When installed to disk, Kali would consistently crash after a few mins, even quicker if a shared folder was configured. Of course I cannot save anything, but I think this is an indication to support the aforementioned virtio issues.

tobhe commented 10 months ago

Thanks for the report!

A fixed Ubuntu mantic kernel is available for testing as 6.5.0-13.13 in https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/proposed2/+packages

We would really appreciate if you could help us test and see if the problem is resolved with the new version. Once we are sure the problem is fixed we will publish the new kernel to the archive.

wrmack commented 10 months ago

Would like to test 6.5.0-13.13 but I had already upgraded to development branch of Noble. Using kernel 6.6 at the moment and it is reasonably stable. I tried installing and booting into 6.5.0-13.13 a few times but I kept getting freezes. Nothing useful in journalctl to indicate what caused this.

wdormann commented 10 months ago

I've got an environment set up that just happens to tickle this bug, so it's pretty easy for me to test various configurations. (So I'll happily volunteer my time with my setup to test any what-if theories to help nail down this bug). The most recent thing that I discovered is that even with the latest Linux kernel, I can trigger filesystem corruption if my external disk is formatted with ExFAT. Given that the guest OS is agnostic to whatever filesystem houses its host-level disk backing file, I can only imagine that the appearance of disk corruption is somehow tied to the speed and latency and whatever other attribute of the disk as the guest VM sees it. And somehow the combination of being on an external USB-connected drive that is also formatted with ExFAT, that triggers the corruption bug usually within minutes for me. If I take the exact same disk and reformat it with APFS, certain Linux guest VMs behave quite solidly. And yes, just to eliminate variables, I've swapped the underlying M2 SSD, the enclosure, and the cable that connects it to my computer. None of those changes affected the behavior. So, given an APFS-formatted drive, I see:

Linux kernel 6.5.6: No corruption

Linux kernel 6.4.15 (which was mentioned in another thread about this topic): No corruption

The proposed (to fix corruption outlined in this ticket) Linux kernel 6.5.0-13.13: Corruption

Note that in this screenshot, the BTRFS corruption is evident in dmesg, but oddly isn't yet in the btrfs check output. But given enough time it will show up in btrfs check, and depending on my luck the filesystem may eventually get switched to read-only mode.

My best guesses at this point are:

Whatever fix that went into this 6.5.0-13.13 kernel isn't enough to prevent corruption to the level of Linux 6.5.6 or even 6.4.15.
Even Linux kernel 6.5.6 (or 6.4.15) has a small chance of encountering data corruption, depending on the disk operations that are happening combined with the latency/speed/?? of said storage. Like I said, in my case, I can easily reproduce the corruption with a VM that lives on an ExFAT-formatted volume that lives on a USB 3.2 connected drive. But YMMV.

I suppose one other alternative to theory 2) above is that the MacOS ExFAT implementation may subtly introduce disk corruption under certain workloads in a way that's not even noticeable from the host level. But this seems somewhat unlikely based on the fact that I've had Windows and macOS VMs running in the same configuration, from the same ExFAT-backed storage, running the same workload (building and cleaning Qt6 in an infinite loop), and neither of those guest OSes experienced any corruption whatsoever when running for over 24 hours straight.

So goal 1 might be to get Ubuntu-packaged ARM64 kernels to the stability parity level of 6.4.15 or 6.5.6. And a second more perseverant goal might be to see what possible remaining data-corruption bugs that even the 6.5.6 Linux kernel may have. Given the seemingly rare configuration that exhibits this bug, this might be a tough ask.

wpiekutowski commented 10 months ago

Unfortunately I'm getting this problem on 6.6.0 as well. I didn't bother to fix the broken files and got 6.6.1, so I'm not sure if this one is buggy too. I'm using the internal storage. It's a good idea, as above, to use BTRFS or any other filesystem with metadata and data checksums, so you can earlier see when things go wrong. With btrfs scrub start you can check how many errors are there already. You might even see which files are affected in dmesg output. If BTRFS isn't too corrupted, you can recover by deleting affected files and restoring them from package manager or backups. Also a good idea is to use NixOS, so you can quickly reinstall your VM if it can't be fixed anymore.

gnattu commented 10 months ago

Whatever fix that went into this 6.5.0-13.13 kernel isn't enough to prevent corruption to the level of Linux 6.5.6 or even 6.4.15.

@wdormann Would you like to test the workaround patch that ~~I posted i this issue here~~? If you need further help with using this patch please let me know and I would post a deb kernel of this patch based on linux 6.5.0-14.14 ubuntu kernel if I have time.

Edit2: The pre-built kernel deb for ubuntu is here

You need to install the linux-modules-6.5.0-14-generic_6.5.0-14.14_arm64.deb first, then install linux-image-unsigned-6.5.0-14-generic_6.5.0-14.14_arm64.deb. If you are having conflicts reported by dpkg, then you probably need to manually uninstall linux-generic and linux-image-generic first.

After the installation you should be able to see this line in dmesg:

[    0.000000] Linux version 6.5.0-14-generic (root@ubuntu23) (aarch64-linux-gnu-gcc-13 (Ubuntu 13.2.0-4ubuntu3) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.41) #14 SMP PREEMPT_DYNAMIC Sat Nov 18 14:21:19 CST 2023 (Ubuntu 6.5.0-14.14-generic 6.5.3)

The root@ubuntu23 and the kernel build date would indicate the that you are using the custom kernel not the ubuntu stock 6.5.0-14

Edit: My previous patch does not apply to 6.5.0-14.14. This is the new patch:

--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -31,6 +31,8 @@
 #define VIRTIO_BLK_INLINE_SG_CNT   2
 #endif

+#define FLUSH_INTERVAL (msecs_to_jiffies(1000))
+
 static unsigned int num_request_queues;
 module_param(num_request_queues, uint, 0644);
 MODULE_PARM_DESC(num_request_queues,
@@ -84,6 +86,9 @@ struct virtio_blk {

    /* For zoned device */
    unsigned int zone_sectors;
+   unsigned long last_flush;
+   struct delayed_work flush_dwork;
+   struct bio flush_bio;
 };

 struct virtblk_req {
@@ -426,6 +431,36 @@ static blk_status_t virtblk_prep_rq(stru
    return BLK_STS_OK;
 }

+
+static bool virtblk_skip_flush(struct virtio_blk *vblk, struct request *req)
+{
+   if (req_op(req) != REQ_OP_FLUSH)
+       return false;
+   if (req->cmd_flags & REQ_DRV)
+       return false;
+   if (delayed_work_pending(&vblk->flush_dwork))
+       return true;
+   if (time_before(jiffies, vblk->last_flush + FLUSH_INTERVAL)) {
+       kblockd_mod_delayed_work_on(WORK_CPU_UNBOUND, &vblk->flush_dwork, FLUSH_INTERVAL);
+       return true;
+   }
+
+   vblk->last_flush = jiffies;
+   return false;
+}
+
+static void virtblk_flush_work(struct work_struct *work)
+{
+   struct virtio_blk *vblk = container_of(work, struct virtio_blk, flush_dwork.work);
+   struct bio *bio = &vblk->flush_bio;
+
+   bio_init(bio, vblk->disk->part0, NULL, 0,
+        REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH | REQ_DRV);
+   submit_bio(bio);
+
+   vblk->last_flush = jiffies;
+}
+
 static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
               const struct blk_mq_queue_data *bd)
 {
@@ -442,6 +477,11 @@ static blk_status_t virtio_queue_rq(stru
    if (unlikely(status))
        return status;

+   if (virtblk_skip_flush(vblk, req)) {
+       blk_mq_complete_request(req);
+       return BLK_STS_OK;
+   }
+
    spin_lock_irqsave(&vblk->vqs[qid].lock, flags);
    err = virtblk_add_req(vblk->vqs[qid].vq, vbr);
    if (err) {
@@ -487,6 +527,12 @@ static bool virtblk_add_req_batch(struct
    while (!rq_list_empty(*rqlist)) {
        struct request *req = rq_list_pop(rqlist);
        struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
+       struct virtio_blk *vblk = req->mq_hctx->queue->queuedata;
+
+       if (virtblk_skip_flush(vblk, req)) {
+           blk_mq_complete_request(req);
+           continue;
+       }

        err = virtblk_add_req(vq->vq, vbr);
        if (err) {
@@ -1368,7 +1414,7 @@ static int virtblk_probe(struct virtio_d
    vblk->tag_set.ops = &virtio_mq_ops;
    vblk->tag_set.queue_depth = queue_depth;
    vblk->tag_set.numa_node = NUMA_NO_NODE;
-   vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+   vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_SCHED_BY_DEFAULT;
    vblk->tag_set.cmd_size =
        sizeof(struct virtblk_req) +
        sizeof(struct scatterlist) * VIRTIO_BLK_INLINE_SG_CNT;
@@ -1562,6 +1608,9 @@ static int virtblk_probe(struct virtio_d
            q->limits.discard_granularity = blk_size;
    }

+   vblk->last_flush = jiffies - FLUSH_INTERVAL;
+   INIT_DELAYED_WORK(&vblk->flush_dwork, virtblk_flush_work);
+
    virtblk_update_capacity(vblk, false);
    virtio_device_ready(vdev);

@@ -1602,6 +1651,7 @@ static void virtblk_remove(struct virtio

    /* Make sure no work handler is accessing the device. */
    flush_work(&vblk->config_work);
+   flush_delayed_work(&vblk->flush_dwork);

    del_gendisk(vblk->disk);
    blk_mq_free_tag_set(&vblk->tag_set);
@@ -1632,6 +1682,7 @@ static int virtblk_freeze(struct virtio_

    /* Make sure no work handler is accessing the device. */
    flush_work(&vblk->config_work);
+   flush_delayed_work(&vblk->flush_dwork);

    blk_mq_quiesce_queue(vblk->disk->queue);

wdormann commented 10 months ago

Thanks for the DEBs, as that makes things easier. Sadly, in my particular environment that has shown itself to be likely to trigger corruption, this kernel version also demonstrates corruption after a short amount of time.

On the other hand, a FreeBSD VM with the exact same workload and location to live on my host platform works flawlessly, and overnight at that. And I believe that ZFS should be well aware of when corruption happens, as opposed to more traditional filesystems.

gnattu commented 10 months ago

Thanks for the DEBs, as that makes things easier. Sadly, in my particular environment that has shown itself to be likely to trigger corruption, this kernel version also demonstrates corruption after a short amount of time.

Oh, then doing that is not enough either. Things are getting tricky now. Have you still been encountering IO Errors like the ones mentioned in this comment? If there are no IO Errors like this, and there are still 'silent' corruptions on the disk, then we might have something outside of the virtio_drv driver itself triggering this bug.

wdormann commented 10 months ago

I'm not seeing IO errors. I'll say that depending on how the VM is configured, the device may be /dev/sda, so I'm pretty sure that this implies that virtio is being used for storage. As opposed to /dev/vda, that is.

An example dmesg from a corrupting VM is:

[   25.630347] audit: type=1400 audit(1700341471.031:60): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap.cups.cups-browsed" pid=1220 comm="apparmor_parser"
[   76.677067] BTRFS warning (device dm-0): csum failed root 5 ino 3227676 off 49983488 csum 0xeec14ccd expected csum 0x8941f998 mirror 1
[   76.677074] BTRFS error (device dm-0): bdev /dev/mapper/ubuntu--vg-ubuntu--lv errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
[   76.677080] BTRFS warning (device dm-0): csum failed root 5 ino 3227676 off 49987584 csum 0x3c6a50d6 expected csum 0x8941f998 mirror 1
[   76.677082] BTRFS error (device dm-0): bdev /dev/mapper/ubuntu--vg-ubuntu--lv errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
[   76.677083] BTRFS warning (device dm-0): csum failed root 5 ino 3227676 off 49991680 csum 0xd43e4a55 expected csum 0x8941f998 mirror 1
[   76.677084] BTRFS error (device dm-0): bdev /dev/mapper/ubuntu--vg-ubuntu--lv errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
[   76.677086] BTRFS warning (device dm-0): csum failed root 5 ino 3227676 off 49995776 csum 0xb139a859 expected csum 0x8941f998 mirror 1
[   76.677087] BTRFS error (device dm-0): bdev /dev/mapper/ubuntu--vg-ubuntu--lv errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
[   76.677088] BTRFS warning (device dm-0): csum failed root 5 ino 3227676 off 49999872 csum 0x2a69759f expected csum 0x8941f998 mirror 1
[   76.677089] BTRFS error (device dm-0): bdev /dev/mapper/ubuntu--vg-ubuntu--lv errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
[   76.677091] BTRFS warning (device dm-0): csum failed root 5 ino 3227676 off 50003968 csum 0x53df5d94 expected csum 0x8941f998 mirror 1
[   76.677091] BTRFS error (device dm-0): bdev /dev/mapper/ubuntu--vg-ubuntu--lv errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
[   76.677093] BTRFS warning (device dm-0): csum failed root 5 ino 3227676 off 50008064 csum 0xe4aa7e3d expected csum 0x8941f998 mirror 1
[   76.677094] BTRFS error (device dm-0): bdev /dev/mapper/ubuntu--vg-ubuntu--lv errs: wr 0, rd 0, flush 0, corrupt 7, gen 0
[   76.677095] BTRFS warning (device dm-0): csum failed root 5 ino 3227676 off 50012160 csum 0x6094b173 expected csum 0x8941f998 mirror 1
[   76.677096] BTRFS error (device dm-0): bdev /dev/mapper/ubuntu--vg-ubuntu--lv errs: wr 0, rd 0, flush 0, corrupt 8, gen 0
[   76.677098] BTRFS warning (device dm-0): csum failed root 5 ino 3227676 off 50016256 csum 0xd05dddb6 expected csum 0x8941f998 mirror 1
[   76.677099] BTRFS error (device dm-0): bdev /dev/mapper/ubuntu--vg-ubuntu--lv errs: wr 0, rd 0, flush 0, corrupt 9, gen 0
[   76.677100] BTRFS warning (device dm-0): csum failed root 5 ino 3227676 off 50020352 csum 0x0c8d8613 expected csum 0x8941f998 mirror 1
[   76.677101] BTRFS error (device dm-0): bdev /dev/mapper/ubuntu--vg-ubuntu--lv errs: wr 0, rd 0, flush 0, corrupt 10, gen 0

The first line is the end of the normal dmesg output, and immediately after that it goes into BTRFS recognizing failed checksums. No IO Errors preceed the BTRFS errors.

wpiekutowski commented 10 months ago

TL;DR

It's possible to reproduce this problem with stress-ng or parallel cp.
@gnattu patch improves things, but not 100%.
Various Linux kernels have the same problem.
sync mount option doesn't help.
Problem doesn't occur on macOS host or guest under the same conditions or doesn't occur at all.
Problem doesn't happen during linux kernel compilation, even though I'm using 10 core CPU. I did 4 compilations.
Problem does sometimes happen during NixOS generation switch, randomly.

Reproducing the problem

I found 2 ways to easily reproduce this in my case.

`stress-ng` (without patch)

stress-ng --iomix 2. I quickly see BTRFS errors in dmesg. Note that it doesn't make sense to run this on a fs without data checksums. It didn't show me any i/o errors on ext4.

parallel `cp` (without patch)

Download a big file, like 5 GB, for example a linux distro iso. Then cp --reflink=never ubuntu-22.04.3-desktop-amd64.iso ubuntu-22.04.3-desktop-amd64-2.iso & cp --reflink=never ubuntu-22.04.3-desktop-amd64.iso ubuntu-22.04.3-desktop-amd64-3.iso && fg. This seems to work on the first sight – no issues in dmesg or terminal. Now let's see if the contents are ok: sha256sum *.iso. This results in at least one file reporting i/o error and a bunch of errors in dmesg. On ext4, there're no errors in dmesg, but shasums are incorrect.

@gnattu patch

Applied on top of 6.5.11, it fixed the stress-ng generated errors! Also it's harder to cause parallel cp error. Now 2 simultaneous coping process result in valid files. I had to go up to 5 to see the problem again. I'll test in the coming days with regular usage.

macOS results

These 2 things above run fine in a macOS vm or macOS host, even with higher number of stress-ng threads or more files copied at once. However, stress-ng on macOS might not report errors because APFS doesn't checksum data.

Various Linux kernels (without patch)

I've also run this on kernels 5.15, 6.6.1 and asahi-6.5-29 and asahi-6.1-pre1 with the same results. Building kernels from source on NixOS is a piece of cake, unless you get a disk error in some critical file.

`sync` mount option (without patch)

Adding sync to mount options didn't help (stress-ng --iomix 4 triggered the problem after about 2 minutes), just made things terribly slow by todays standards.

NixOS observations (without patch)

I did around 4 kernel compilations and problem didn't mainifest itself. I'm using 10 core CPU, so if it's just about paralelism, then it should happen multiple times.

Problem does sometimes happen during NixOS generation switch, randomly. Multiple files are read and written during this process, also sometimes downloaded. I'm not sure how much parallelism is there.

wdormann commented 10 months ago

stress-ng --iomix 2. I quickly see BTRFS errors in dmesg. Note that it doesn't make sense to run this on a fs without data checksums. It didn't show me any i/o errors on ext4.

Can confirm. This is an easy way to trigger the bug, and for me it even worked on a Linux VM that was of a recent kernel (6.5.6) and was running on my internal SSD (which is perhaps obviously formatted as APFS). This only took a minute or two.

wpiekutowski commented 10 months ago

After 3 hours of stress-ng --iomix 4 I didn't get any errors, so @gnattu's patch is definitely helping. Parallel cp still results in corrupted files, but at least we're getting somewhere.

BTW here's my NixOS kernel config:

  boot.kernelPackages = pkgs.linuxPackages_6_5;
  boot.kernelPatches = [
    {
      name = "virtio-fix";
      patch = /root/virtio.patch;
    }
  ];

wpiekutowski commented 10 months ago

After some more time, stress-ng started erroring for me. Seems like this isn't the solution after all.

gnattu commented 10 months ago

After some more time, stress-ng started erroring for me. Seems like this isn't the solution after all.

This does not meant to be a "solution", just a workaround that throttles the cache flushing so that we are less likely to see an fs error in most real-world workloads. We could still have fs erros in extreme cases like what you presented, but I think the root problem is probably not inside the implementation of the virtio block device or its driver but somewhere else in the kernel. I'm not 100% sure about it but we are seeing 1. non-virtio devices also sees similar problems as we are seeing reports in this issue and 2. other OSes like FreeBSD seems to work fine with the same virtio device. Apples Virtualization framework adds a new VZNVMExpressControllerDeviceConfiguration that implements nvme interface instead of the virtio interface, but it is not supported by UTM yet. I'm going to build a minimal app for testing this device to see if it also suffers from this issue.

gnattu commented 10 months ago

So with fedora stock 6.5.6 kernel, a quick 11-minutes stress-ng does not cause btrfs errors with the new virtual nvme device introduced in Sonoma. So the bug is probably related to the virtio device, more specificaly, related to the virtio device under heavy cache flushing pressure.

gnattu commented 10 months ago

I re-ran the test, using the same stock 6.5.6 fedora kernel, but this time with virtio. I changed my configuration in my minimal test app so that the disk caching is disabled in the hypervisor and the disk synchronization mode configured to full sync from the hypervisor. And I'm not seeing the error here from the same quick test ran for ~10 minutes. Running at such mode could reduces IO performance for writing and I'm not sure if it should be used for most workloads.

gnattu commented 10 months ago

Oh, I think I just made the conclusion too soon. After a reboot I'm seeing btrfs error even with disabled cache and full sync set on hypervisor side with virtio. I'm going to re-test the nvme case and to see if it has the same issue after reboot.

gnattu commented 10 months ago

The emulated nvme device does not suffer from the reboot and corrupt issue. Even when I uncleanly shut down the VM, I only get file-corrupt warning from dmesg not a filesystem error. The file-corruption is expected when you uncleanly shutdown a VM, but the filesystem should not corrupt.

I also noticed that the emulated NVMe device is slower than the virtio device. However, for data safety reasons, I believe we should switch to using this device as the default on macOS Sonoma to mitigate filesystem corruption until we find a better solution. It's quite strange to me that only Linux is experiencing such problems, whereas FreeBSD and macOS using the same virtio interface appears to work fine.

wpiekutowski commented 10 months ago

Great find @gnattu! I confirm it works for me and I see about 5-10% performance penalty.

I've found something else that fixes this as well. I was playing with Apple's GUI Linux VM demo. It's using virtio. I've noticed you can control caching mode of the attached disk images.

VZDiskImageStorageDeviceAttachment(
  url: URL(fileURLWithPath: path),
  readOnly: false,
  cachingMode: .uncached,
  synchronizationMode: .full)

Caching mode:

automatic is the default – corruption occurs
uncached – corruption occurs
cached – no issues!

The difference between these options comes down to whether the host macOS is using its memory to cache access to the disk image or not. automatic defaults to uncached for me, even though I have 32 GB RAM (10 GB used by apps at vm boot). I'd say using cached is a waste of RAM, because guest is caching stuff as well and it knows better what to cache. But I've found cached to be very reliable.

Synchronization mode – I've tried all 3 values with caching set to uncached, but I always got corrupted files.

NVMe isn't affected by caching mode – it works well with all 3 caching options.

I'll report all this to Apple through the feedback assistant. Maybe one day somebody will take a look at it a fix it. I've attached my report in case anybody would like to do the same to put a bit of pressure.

apple virtio feedback.txt

wdormann commented 10 months ago

While it seems like folks are starting to get a grasp of the issue at hand here, I figured I'd share my latest test results to perhaps help eliminate variables. With the knowledge that stress-ng --iomix 4 can trigger the bug with a Linux VM within seconds of having started, I did the same test with a macOS VM with the same host-level disk backing. I let it run for 17 hours, and there wasn't any evidence of trouble in dmesg or Disk Utility.

Regarding the caching mode, I can confirm that this indeed does avoid filesystem corruption in my test VM. For those playing along, in GUILinuxVirtualMachineSampleApp/AppDelegate.swift, change it to be:

    private func createBlockDeviceConfiguration() -> VZVirtioBlockDeviceConfiguration {
        guard let mainDiskAttachment = try? VZDiskImageStorageDeviceAttachment(url: URL(fileURLWithPath: mainDiskImagePath), readOnly: false, cachingMode: .cached, synchronizationMode: .full) else {
            fatalError("Failed to create main disk attachment.")
        }

Now, just so that I (and folks following this thread) can even have a hope of what's going on here...

1) Given that Windows, macOS, and friends all don't have any corruption (ever, it seems), what about Linux might cause only it to experience corruption? Other OS's aren't using the flawed virtualized storage device? Just by luck/design, Linux exhibits the usage patterns that triggers an Apple-level flaw? Linux has a flaw that gives it a possibility of experiencing corruption?

2) Depending on virtualization/configuration used, I've seen (and I believe others have seen) Linux VMs with disk devices like /dev/sda2, which would suggest that the VM is using a non-virtio device (e.g. SATA). Yet corruption can still happen in these cases.

3) One test that at least I've done is to not use the Apple Virtualization Framework for virtualization (e.g. using UTM with Use Apple Virtualization un-checked at VM provisioning, and it still can experience corruption. Which makes me think that this sort of VM wouldn't use any of the exposed Apple Virtualization capabilities. Unless perhaps QEMU itself is designed in a way to leverage some parts of Apple Virtualization.

So in a nutshell, yes, I can confirm that setting cached mode does indeed avoid the corruption problem. But what I'm curious about is whether this is merely an Apple-level bug, and Linux is just unfortunate enough to see the symptoms of it based on who-knows-what about its use of the virtual storage devices. Or, does aarch64 Linux have a bug (perhaps only visible when running virtualized) that can be mitigated by setting the cached storage option?

hasan4791 commented 10 months ago

@wdormann Just to make sure, did you perform stress test on 6.7 kernel also? I had dev fedora running for more than 2 weeks and haven't seen disk corruption in that one. If i use stable one, it happens almost immediately to me.

wdormann commented 10 months ago

There's nothing preventing Linux 6.7 from avoiding corruption if you're not lucky, in my testing.

Note that this is from my likely-to-corrupt-only-Linux ExFAT storage device, in which newer Linux kernels do appear to fare a bit better than older ones, FWIW.

tie commented 10 months ago

There's nothing preventing Linux 6.7 from avoiding corruption if you're not lucky, in my testing.

Agreed, for some reason this issue manifested for me only recently and I don’t recall updating anything. I’ve tried running NixOS with linuxPackages_6_7 (from nixos-unstable branch at 5a09cb4b393d58f9ed0d9ca1555016a8543c2ac8) and it does not fix the issue.

Running with #5919 merged, I haven’t experienced any corruption yet, although the performance impact is significant. Before switching to NVMe, btrfs scrub speed was around ~1.6 GiB/s (compared to 2–3 GiB/s with the same raw image under QEMU). Now it’s around 0.5 GiB/s at best.

gnattu commented 9 months ago

We have a new situation. Someone is encountering outright NVMe I/O errors on Debian with macOS 14.2 running the 6.1 kernel: [link to the comment]. I couldn't reproduce this in my current environment(Fedora with 6.5 kernel on macOS 14.2) using the stress-ng method. If anyone has experienced this since macOS 14.2, please share your environment details.

Git-North commented 8 months ago

https://github.com/utmapp/UTM/pull/5919

alanfranz commented 8 months ago

I'm using UTM 4.4.4 patched with https://github.com/utmapp/UTM/pull/5919 on Sonoma 14.2.1; VM is Fedora 38 with a 6.6.7 kernel, using an NVMe disk . It worked good so far, but today I got an ext4 corruption:

[13837.061562] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #10228034: comm updatedb: iget: checksum invalid
[13837.061598] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #10228040: comm updatedb: iget: checksum invalid
[13837.061617] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #10228044: comm updatedb: iget: checksum invalid
[13837.061632] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #10228049: comm updatedb: iget: checksum invalid
[13837.061647] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #10228057: comm updatedb: iget: checksum invalid
[60913.627382] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #10228034: comm updatedb: iget: checksum invalid
[60913.627661] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #10228040: comm updatedb: iget: checksum invalid
[60913.627858] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #10228044: comm updatedb: iget: checksum invalid
[60913.628031] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #10228049: comm updatedb: iget: checksum invalid
[60913.628322] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #10228057: comm updatedb: iget: checksum invalid
[85342.514726] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #15702271: comm git: iget: checksum invalid
[85342.514759] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #15702272: comm git: iget: checksum invalid
[85342.581619] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #15702271: comm git: iget: checksum invalid
[85342.581686] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #15702272: comm git: iget: checksum invalid
[85356.290442] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #15702271: comm git: iget: checksum invalid
[85356.290506] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #15702272: comm git: iget: checksum invalid
[85356.297589] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #15702271: comm git: iget: checksum invalid
[85356.297657] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #15702272: comm git: iget: checksum invalid
[85364.250987] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #15702271: comm git: iget: checksum invalid
[85364.251223] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #15702272: comm git: iget: checksum invalid
[85364.255971] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #15702271: comm git: iget: checksum invalid
[85364.256019] EXT4-fs error (device dm-0): ext4_lookup:1855: inode #15702272: comm git: iget: checksum invalid

Now the system doesn't boot and wants me to force an fsck.

The corruption apparently manifested after a resume (of the host).

fcool commented 7 months ago

We are using a build based on gnattus patch for 2 weeks now.

But we experienced exactly the same as alanfranz: using NvME-Interface and extracting a gentoo stage 3 sometimes lead to file system inconsistencies. Disabling NvME and going with the "cached" disk interface has made no problems so far. It is not reliably reproducable, but the probabilty of a broken filesystem grows with each extraction, and can be made more or less reliable, if one extracs the same tar archive 3 times in a row.

jkleckner commented 7 months ago

Is the cachingMode setting available via the UTM UI if I want to try this out?

fcool commented 7 months ago

No it is not. But it is chosen automatically, if switching of NVMe

wpiekutowski commented 6 months ago

I've got a response from Apple about this problem through the Feedback Assistant. They claim this problem should have been fixed in macOS 14.4. It isn't mentioned in the Release Notes. I'm currently not using UTM, so it would be great if somebody else could verify if Virtio in uncached mode doesn't produce disk corruption anymore.

kurgannet commented 6 months ago

Can confirm it still happens in my case. I can't even install Kali Linux. It crashes while writing data to disk.

Tested with macOS 14.4, UTM 4.5, Kali 2024.01 on a MacBook Pro M1 Pro.

kurgannet commented 6 months ago

Also tried on a M2 Mac mini, macOs 14.4, UTM 4.5, Kali 2024.01:

maartenweyns commented 6 months ago

So is there any progress on this issue? I'm running into this as well where I'm actually unable to compete a Debian install because it'll freeze before it's finished due to this read-only disk issue.

UTM 4.4.5 on an M2 Macbook Pro, macOS 14.4

wpiekutowski commented 6 months ago

The only known solution at this point is to build UTM yourself from the PR #5919

maartenweyns commented 6 months ago

I just built UTM from that PR and it actually does not fix the issue for me. I am still unable to complete a Debian install as it randomly freezes from time to time. I have verified that it is using the NVMe interface.

wpiekutowski commented 6 months ago

How about Virtio (NVMe disabled)? It should work because cached mode is enabled.

wrmack commented 6 months ago

FWIW - I am using tart in the meantime. Gist here.

maartenweyns commented 5 months ago

How about Virtio (NVMe disabled)? It should work because cached mode is enabled.

Aha, that does indeed work! Custom build form #5919 with NVMe disabled makes the machine live hapily! Thanks for the tip. Now I can go get a refund on Parallels ;)

utmapp / UTM

Apple Virtual machine keeps setting itself as read-only #4840

Reproducing the problem

`stress-ng` (without patch)

parallel `cp` (without patch)

@gnattu patch

macOS results

Various Linux kernels (without patch)

`sync` mount option (without patch)

NixOS observations (without patch)

utmapp / UTM

Apple Virtual machine keeps setting itself as read-only #4840

Reproducing the problem

stress-ng (without patch)

parallel cp (without patch)

@gnattu patch

macOS results

Various Linux kernels (without patch)

sync mount option (without patch)

NixOS observations (without patch)

`stress-ng` (without patch)

parallel `cp` (without patch)

`sync` mount option (without patch)