VM disk corruption with Apple Silicon

EdwardMoyse commented 1 year ago

[!TIP]

EDIT by @AkihiroSuda

For --vm-type=vz, this issue seems to have been solved in Lima v0.19 (https://github.com/lima-vm/lima/pull/2026)

Description

Lima version: 0.18.0 macOS: 14.0 (23A344) VM: Almalinux9

I was trying to do a big compile, using a VM with the attached configuration (vz)

NAME           STATUS     SSH                VMTYPE    ARCH       CPUS    MEMORY    DISK      DIR
myalma9        Running    127.0.0.1:49434    vz        aarch64    4       16GiB     100GiB    ~/.lima/myalma9

The build aborted with:

from /Volumes/Lima/build/build/AthenaExternals/src/Geant4/source/processes/hadronic/models/lend/src/[xDataTOM_LegendreSeries.cc](http://xdatatom_legendreseries.cc/):7:
/usr/include/bits/types.h:142:10: fatal error: /usr/include/bits/time64.h: Input/output error

And afterwards, even in a different terminal, I see:

[emoyse@lima-myalma9 emoyse]$ ls
bash: /usr/bin/ls: Input/output error

I was also logged into a display, and there I saw e.g.

If I try to log in again with:

limactl shell myalma9

each time I see something like the following appear in the display window:

[56247.6427031] Core dump to l/usr/lib/systemd/systemd-coredump pipe failed

Edit: there has been a lot of discussion below, and the corruption can happen with both vz and qemu, and on external (to the VM) and internal disks. Some permutations seem more likely to provoke a corruption than others. I have reproduced my experiments in the table in the following comment below.

EdwardMoyse commented 1 year ago

In case it is relevant, I was compiling in a separate APFS (Case-sensitive) Volume as described here. This volume seems absolutely fine - so the corruption seems limited to the VM itself. I can't see how this could have happened with 100GB but I wonder if it's possible that it ran out of space? I could try increasing the disk size, but the whole point of using an external volume was that this would not be necessary.

AkihiroSuda commented 1 year ago

Is this specific to AlmaLinux 9?
Is this specific to --vm-type=vz?
Do you see some meaningful message in dmesg?

EdwardMoyse commented 1 year ago

Hmm. I just tried again but compiling in /tmp rather than the case-sensitive volume, and this worked fine. A colleague has confirmed a similar experience - problems with /Volumes/Lima, but works fine in /tmp. So my best guess right now is it is some interaction with an APFS Volume and Lima (which might also explain the following "stuck VM" discussion : https://github.com/lima-vm/lima/discussions/1666)

Answering your other questions:

I'm not sure if it is specific to AlmaLinux9 since my use-case requires that particular OS. But I will try doing a big compile on a different OS and in a Volume and to see if I can replicate.
We have not replicated with qemu but this is so slow that this is very hard to do. I will try. I will also try again with dmesg running.

afbjorklund commented 1 year ago

You would get much better performance with a local filesystem, as well. If you want to keep it separate from the OS image, you could add a native Lima image using the limactl disk command. And copy the results, when the build is done.

EDIT: One potential feature could be to be able create disk images on an attached disk, instead of under LIMA_HOME. You can probably use symlinks from _disks as a workaround, but would be better with some optional flag support...

afbjorklund commented 1 year ago

If you really need to access 100 GiB from the host, then we might have to add some more "boring" type like NFS... It seems like sftpd and virtiofsd, and also their Linux clients, have some stability issues when put under pressure?

EdwardMoyse commented 1 year ago

EDIT: One potential feature could be to be able create disk images on an attached disk, instead of under LIMA_HOME. You can probably use symlinks from _disks as a workaround, but would be better with some optional flag support...

This would, I think, really help us.

Our use-case is this - we want to be able to edit files from within macOS, but then compile inside Almalinux9. The codebase we are compiling is relatively large (>4 million lines of C++) and can take up to 400 GB of temporary compilation space. I was reluctant to make separate VMs with this much local storage, especially since a lot of us will be working on laptops. Ideally we would have a large build area (possibly on an external drive), accessible from several VMs, and with very fast disk io to the VM (since otherwise the build time can become unusably slow). We do NOT, in general, need to be able to access this build area from the host (at least, not with fast io - it would mainly be to examine compilation failures)

(I will get back to the other tests shortly - but I'm currently travelling with limited work time, and it seems very likely that the issue is related to compiling outside the VM)

AkihiroSuda commented 1 year ago

(I will get back to the other tests shortly - but I'm currently travelling with limited work time, and it seems very likely that the issue is related to compiling outside the VM)

I'm not sure how virtiofs affects the XFS disk, but maybe this issue should be reported to Apple?

afbjorklund commented 1 year ago

I was under the impression that the problem was with the /Volumes/Lima mount, but the logs say vda2...

  - location: /Volumes/Lima
    writable: true

So the remote filesystem is a separate topic*, from this ARM64 disk corruption. Sorry for the added noise.

Though I don't see how switching from remote /Volumes/Lima to local /tmp could have helped, then...

* should continue in a different discussion

Note that disk images cannot be shared... (they can be unplugged and remounted)

AkihiroSuda commented 1 year ago

Is this relevant?

https://github.com/utmapp/UTM/issues/4840

(UTM uses vz too)

Looks like people began to hit this issue since September, so I wonder if Apple introduced a regression on that time?

I still can't repro the issue locally though. (macOS 14.1 on Intel MacBookPro 2020, macOS 13.5.2 on EC2 mac2-m2pro)

AkihiroSuda commented 1 year ago

Can anybody confirm this rumor?

https://github.com/utmapp/UTM/issues/4840#issuecomment-1764436352

Is it me or deactivating ballooning solves the problem? I've deactivated it two weeks ago, and no problem since on my side.

Removing these lines will disable ballooning: https://github.com/lima-vm/lima/blob/7cb2b2e66215dd5f0aac280375645eec67550db4/pkg/vz/vm_darwin.go#L598-L604

wdormann commented 1 year ago

For what it's worth, I believe I've narrowed down the problem that I've noticed in https://github.com/utmapp/UTM/issues/4840 to having used an external SSD drive. I've not reproduced the corruption if the VM lives on my Mac's internal storage.

@EdwardMoyse Your separate APFS volume... is it on the same storage device that your Mac runs on, or is it a separate external device?

@AkihiroSuda I've not seen disabling the Balloon device to help with preventing corruption. At least, if I'm working with a QEMU-based VM that lives on my external SSD storage, it has Balloon Device un-checked by default, and the VM's filesystem will eventually corrupt under heavy disk load. So I believe this is a red herring.

AkihiroSuda commented 1 year ago

I'm working with a QEMU-based VM

Probably, you are hitting a different issue with a similar symptom ?

EdwardMoyse commented 1 year ago

@wdormann my APFS Volume is on same device (SSD) as macOS. It's not an external device in my case.

wdormann commented 1 year ago

Thanks for the input. I've been testing the disk itself, and it has yet to report errors. Given your successful test in /tmp, these both seem to point to a problem using a non-OS volume for the underlying VM OS storage?

AkihiroSuda commented 1 year ago

I think I reproduced the issue with the default Ubuntu template:

[  299.527200] EXT4-fs error (device vda1): ext4_lookup:1851: inode #3793: comm apport: iget: checksum invalid
[  299.527255] Aborting journal on device vda1-8.
[  299.527293] EXT4-fs error (device vda1): ext4_journal_check_start:83: comm cp: Detected aborted journal
[  299.528985] EXT4-fs error (device vda1): ext4_journal_check_start:83: comm rs:main Q:Reg: Detected aborted journal
[  299.530464] EXT4-fs (vda1): Remounting filesystem read-only
[  299.530515] EXT4-fs error (device vda1): ext4_lookup:1851: inode #3794: comm apport: iget: checksum invalid
[  299.535137] EXT4-fs error (device vda1): ext4_lookup:1851: inode #3795: comm apport: iget: checksum invalid
[  299.538878] EXT4-fs error (device vda1): ext4_lookup:1851: inode #3796: comm apport: iget: checksum invalid
[  299.543827] EXT4-fs error (device vda1): ext4_lookup:1851: inode #3797: comm apport: iget: checksum invalid
[  299.550614] EXT4-fs error (device vda1): ext4_lookup:1851: inode #3798: comm apport: iget: checksum invalid
[  299.551947] EXT4-fs error (device vda1): ext4_lookup:1851: inode #3799: comm apport: iget: checksum invalid
[  299.553651] EXT4-fs error (device vda1): ext4_lookup:1851: inode #3800: comm apport: iget: checksum invalid
[  299.821872] audit: type=1131 audit(1698675832.913:271): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=systemd-journald comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
[  299.821967] BUG: Bad rss-counter state mm:0000000013fa5858 type:MM_FILEPAGES val:43
[  299.821980] BUG: Bad rss-counter state mm:0000000013fa5858 type:MM_ANONPAGES val:3
[  299.821982] BUG: non-zero pgtables_bytes on freeing mm: 4096
[  299.822551] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000070
[  299.822566] Mem abort info:
[  299.822566]   ESR = 0x0000000096000004
[  299.822568]   EC = 0x25: DABT (current EL), IL = 32 bits
[  299.822569]   SET = 0, FnV = 0
[  299.822570]   EA = 0, S1PTW = 0
[  299.822570]   FSC = 0x04: level 0 translation fault
[  299.822571] Data abort info:
[  299.822572]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  299.822573]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  299.822574]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  299.822575] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000100970000
[  299.822576] [0000000000000070] pgd=0000000000000000, p4d=0000000000000000
[  299.822604] Internal error: Oops: 0000000096000004 [#1] SMP
[  299.822615] Modules linked in: tls nft_chain_nat overlay xt_tcpudp xt_nat xt_multiport xt_mark xt_conntrack xt_comment xt_addrtype xt_MASQUERADE nf_tables nfnetlink ip6table_filter iptable_filter ip6table_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables veth bridge stp llc tap isofs binfmt_misc nls_iso8859_1 vmw_vsock_virtio_transport vmw_vsock_virtio_transport_common vsock virtiofs joydev input_leds drm 
[  299.822800] Unable to handle kernel paging request at virtual address fffffffffffffff8
[  299.822805] Mem abort info:
[  299.822805]   ESR = 0x0000000096000004
[  299.822806]   EC = 0x25: DABT (current EL), IL = 32 bits
[  299.822807]   SET = 0, FnV = 0
[  299.822808]   EA = 0, S1PTW = 0
[  299.822809]   FSC = 0x04: level 0 translation fault
[  299.822810] Data abort info:
[  299.822810]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  299.822811]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  299.822812]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  299.822813] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000864e50000
[  299.822814] [fffffffffffffff8] pgd=0000000000000000, p4d=0000000000000000
[  361.102020] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  361.102094] rcu:     1-...0: (1 GPs behind) idle=e0b4/1/0x4000000000000000 softirq=23608/23609 fqs=6997
[  361.102102] rcu:              hardirqs   softirqs   csw/system
[  361.102103] rcu:      number:        0          0            0
[  361.102104] rcu:     cputime:        0          0            0   ==> 30000(ms)
[  361.102105] rcu:     (detected by 3, t=15002 jiffies, g=38213, q=860 ncpus=4)
[  361.102107] Task dump for CPU 1:
[  361.102108] task:systemd         state:S stack:0     pid:1     ppid:0      flags:0x00000002
[  361.102111] Call trace:
[  361.102118]  __switch_to+0xc0/0x108
[  361.102180]  seccomp_filter_release+0x40/0x78
[  361.102203]  release_task+0xf0/0x238
[  361.102216]  wait_task_zombie+0x124/0x5c8
[  361.102218]  wait_consider_task+0x244/0x3c0
[  361.102220]  do_wait+0x178/0x338
[  361.102222]  kernel_waitid+0x100/0x1e8
[  361.102224]  __do_sys_waitid+0x2bc/0x378
[  361.102226]  __arm64_sys_waitid+0x34/0x60
[  361.102228]  invoke_syscall+0x7c/0x128
[  361.102230]  el0_svc_common.constprop.0+0x5c/0x168
[  361.102231]  do_el0_svc+0x38/0x68
[  361.102232]  el0_svc+0x30/0xe0
[  361.102234]  el0t_64_sync_handler+0x148/0x158
[  361.102236]  el0t_64_sync+0x1b0/0x1b8
[  541.118359] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  541.118368] rcu:     1-...0: (1 GPs behind) idle=e0b4/1/0x4000000000000000 softirq=23608/23609 fqs=27191
[  541.118371] rcu:              hardirqs   softirqs   csw/system
[  541.118372] rcu:      number:        0          0            0
[  541.118373] rcu:     cputime:        0          0            0   ==> 210020(ms)
[  541.118375] rcu:     (detected by 3, t=60007 jiffies, g=38213, q=1790 ncpus=4)
[  541.118377] Task dump for CPU 1:
[  541.118379] task:systemd         state:S stack:0     pid:1     ppid:0      flags:0x00000002
[  541.118382] Call trace:
[  541.118383]  __switch_to+0xc0/0x108
[  541.118390]  seccomp_filter_release+0x40/0x78
[  541.118393]  release_task+0xf0/0x238
[  541.118396]  wait_task_zombie+0x124/0x5c8
[  541.118399]  wait_consider_task+0x244/0x3c0
[  541.118401]  do_wait+0x178/0x338
[  541.118403]  kernel_waitid+0x100/0x1e8
[  541.118405]  __do_sys_waitid+0x2bc/0x378
[  541.118407]  __arm64_sys_waitid+0x34/0x60
[  541.118409]  invoke_syscall+0x7c/0x128
[  541.118411]  el0_svc_common.constprop.0+0x5c/0x168
[  541.118412]  do_el0_svc+0x38/0x68
[  541.118413]  el0_svc+0x30/0xe0
[  541.118415]  el0t_64_sync_handler+0x148/0x158
[  541.118417]  el0t_64_sync+0x1b0/0x1b8

(Non-minimum, non-deterministic) repro steps:

Create a mac2-m2pro (32GB RAM) instance on EC2, with macOS 13.5.2 AMI, and a gp2 EBS volume
Install Lima v0.18.0
Run limactl start --vm-type=vz --cpus=4 --memory=32 --disk=100 --name=vm1
Run limactl start --vm-type=vz --cpus=4 --memory=32 --disk=100 --name=vm2
For each of the VMs, run cp -a /Users/ec2-user/some-large-directory ~. Some of them may fail with cp: ...: Read-only filesystem

- Filesystems:


% mount
/dev/disk5s2s1 on / (apfs, sealed, local, read-only, journaled)
devfs on /dev (devfs, local, nobrowse)
/dev/disk5s5 on /System/Volumes/VM (apfs, local, noexec, journaled, noatime, nobrowse)
/dev/disk5s3 on /System/Volumes/Preboot (apfs, local, journaled, nobrowse)
/dev/disk1s2 on /System/Volumes/xarts (apfs, local, noexec, journaled, noatime, nobrowse)
/dev/disk1s1 on /System/Volumes/iSCPreboot (apfs, local, journaled, nobrowse)
/dev/disk1s3 on /System/Volumes/Hardware (apfs, local, journaled, nobrowse)
/dev/disk5s1 on /System/Volumes/Data (apfs, local, journaled, nobrowse)
map auto_home on /System/Volumes/Data/home (autofs, automounted, nobrowse)
/dev/disk3s4 on /private/tmp/tmp-mount-mDoJ7V (apfs, local, journaled, nobrowse)

% stat -f %Sd / disk5s1

% stat -f %Sd /Users/ec2-user/.lima
disk5s1



The VM disk is located in the default path `~/.lima`.

AkihiroSuda commented 1 year ago

Tried to remove the balloon, but the filesystem still seems to break intermittently

[ 1674.027587] EXT4-fs error (device vda1): ext4_lookup:1851: inode #35601: comm apport: iget: checksum invalid
[ 1674.030317] Aborting journal on device vda1-8.
[ 1674.031818] EXT4-fs error (device vda1): ext4_journal_check_start:83: comm rs:main Q:Reg: Detected aborted journal
[ 1674.031896] EXT4-fs error (device vda1): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
[ 1674.033116] EXT4-fs (vda1): Remounting filesystem read-only
[ 1674.033147] EXT4-fs error (device vda1): ext4_lookup:1851: inode #35602: comm apport: iget: checksum invalid
[ 1674.036501] EXT4-fs error (device vda1): ext4_lookup:1851: inode #35603: comm apport: iget: checksum invalid
[ 1674.037738] EXT4-fs error (device vda1): ext4_lookup:1851: inode #35604: comm apport: iget: checksum invalid
[ 1674.038828] EXT4-fs error (device vda1): ext4_lookup:1851: inode #35605: comm apport: iget: checksum invalid
[ 1674.040034] EXT4-fs error (device vda1): ext4_lookup:1851: inode #35606: comm apport: iget: checksum invalid
[ 1674.041091] EXT4-fs error (device vda1): ext4_lookup:1851: inode #35606: comm apport: iget: checksum invalid
[ 1674.042199] EXT4-fs error (device vda1): ext4_lookup:1851: inode #35606: comm apport: iget: checksum invalid

EdwardMoyse commented 1 year ago

Thanks for the input. I've been testing the disk itself, and it has yet to report errors. Given your successful test in /tmp, these both seem to point to a problem using a non-OS volume for the underlying VM OS storage?

I might be perhaps misunderstanding you, but I don't think I am using "non-OS volume for the underlying VM OS storage".

For clarity, here is my setup:

I create a VM using the standard limactl start almalinux9.yaml --name=alma9, and the VM exists on the main macOS volume.
I create a separate APFS (Case-sensitive) Volume, and make it mountable from within the VM:
location: /Volumes/Lima writable: true
if I compile our software in /Volumes/Lima I get disk corruption, if I use /tmp for the same operation, it works fine.

So I would characterise this as rathe ra problem using a non-OS volume for the intensive disk operations from within the VM.

wdormann commented 1 year ago

I'll admit I'm not familiar with Lima. When you say "make it mountable from within the VM", what does that mean?

You have a virtual hard disk file that lives on that separate APFS volume, and your VM is configured to have that as a second disk drive?
You boot the VM, and somehow from Linux user/kernel land mount your /Volumes/Lima directory? (How?)
Something else?

Perhaps Lima does this all for you under the hood, but I suppose that I'd need to know exactly what it's doing to have any hope of understanding what's going on.

EdwardMoyse commented 1 year ago

I'll admit I'm not familiar with Lima. When you say "make it mountable from within the VM", what does that mean?

You have a virtual hard disk file that lives on that separate APFS volume, and your VM is configured to have that as a second disk drive?

You boot the VM, and somehow from Linux user/kernel land mount your /Volumes/Lima directory? (How?)

It's the latter (but I cannot tell you any technicalities how it works). From within both the host and the VM I can access /Volumes/Lima. See https://lima-vm.io/docs/config/mount/

wdormann commented 1 year ago

Do you specify a mount type in your limactl command line and/or config file? Or, from the VM, what does the mount command report for the filesystem in question?

afbjorklund commented 1 year ago

It is still a mystery how the problems from a remote filesystem, can "spread" to cause I/O errors on a local filesystem...

Points to a bug with the hypervisor, or even the host OS and CPU arch? Unless it turns out to be an EL9 guest issue, not seen on x86_64 but only on aarch64

wdormann commented 1 year ago

The documentation says that all filesystem types other than reverse-sshfs are "experimental". @afbjorklund Your earlier comment suggested that /dev/vda (Virtiofs) was how it was being mounted.

Perhaps those looking for a temporary workaround could try using reverse-sshfs instead?

AkihiroSuda commented 1 year ago

virtiofs doesn't seem relevant. UTM users seem hitting the same issue without using virtiofs: https://github.com/utmapp/UTM/issues/4840

The issue seems also reproducible with Apple's example: https://github.com/utmapp/UTM/issues/4840#issuecomment-1786762407

Moreover: I have tried this example Xcode project (https://developer.apple.com/documentation/virtualization/running_gui_linux_in_a_virtual_machine_on_a_mac) and has the same issues. It's pretty clear to me that the issues is not UTM-related but Apple + Linux related, but I haven't found any other discussion forum. Moreover, the UTM community may be more successful should they raise this issue to the Linux kernel team or Apple.

afbjorklund commented 1 year ago

@afbjorklund Your earlier comment suggested that /dev/vda (Virtiofs) was how it was being mounted.

The screenshot of a log above, showed "XFS (vda2)" as the device in question - so not the virtiofs mount? It was showing I/O errors in both /usr/include and /bin/ls, those are not mounted and not on /Volumes/Lima

It is using virtio (otherwise it would be called sda2), but it is not using virtiofs (the remote filesystem) https://virtio-fs.gitlab.io/ (the names here are somewhat confusing, there is also virtfs - which is called 9p)

So this issue is about a recent problem on Apple.

Then we can have a different discussion about building on network filesystems, instead of on local filesystems. I was just curious about the comment about moving the build to /tmp seems to have "cured" the corruption...

https://github.com/lima-vm/lima/issues/1957#issuecomment-1784120488

AkihiroSuda commented 1 year ago

Ha, 6.5.0? That one in particular is completely broken. Needs this patch. If your package doesn't have it backported, there's your problem.

6.4 should be fine, as should 6.5.6 according to the changelog.

(Our Asahi tree is currently on 6.5.0 with that patch cherry-picked. And yes, that is the second time ARM64 atomics got broken!)

Originally posted by [@]marcan in https://github.com/utmapp/UTM/issues/4840#issuecomment-1790843588

Can anybody try kernel 6.6? (Just released 3 days ago).

afbjorklund commented 1 year ago

As far as I know, AlmaLinux 9.2 is running kernel 5.14 : https://wiki.almalinux.org/release-notes/9.2.html

marcan commented 1 year ago

ARM64 atomics have been broken until last year, when I found the issue and got it fixed (it was breaking workqueues which was causing problems with TTYs for me, but who knows what else). 5.14 (released 2021) is definitely broken unless it's a branch with all the required backports.

Try 6.4, that should work. 6.5.0 was a very recent regression. I would not put much faith in older kernels, especially anything older than 5.18 which is where we started. All bets are off if you're running kernels that old on bleeding edge hardware like this. Lots of bugfixes don't get properly backported into stable branches either. Apple CPUs are excellent at triggering all kinds of nasty memory ordering bugs that no other CPUs do, because they speculate/reorder across ridiculous numbers of instructions and even things like IRQs (yes really).

afbjorklund commented 1 year ago

So that means qemu only, unless running Fedora* ? Seems like Virtualization.framework exposes more of the CPU

* ~~or Ubuntu 23.10~~ <-- needs backport

afbjorklund commented 1 year ago

Probably should get the automatic updates in place, since otherwise Fedora 38 will run 6.2.9 until user remembers*...

* to upgrade to 6.5.8

EdwardMoyse commented 1 year ago

I was just curious about the comment about moving the build to /tmp seems to have "cured" the corruption...

Hey @afbjorklund I've been running some more tests, and I just had corruption from /tmp so it doesn't cure it (but perhaps it is slightly less likely to happen). Updating the original post.

EdwardMoyse commented 12 months ago

My apologies for the delay in replying, but i have been looking into this. The workflow is the same - compile https://gitlab.cern.ch/atlas/atlasexternals using the attached template with various configurations of host, qemu/vz, cores and memory.

TLDR; updating to 6.5.10-1 was more stable on M2 (even on 'shared' volume /tmp/lima), but apparently worse on M1 Pro (though the M1Pro has more cores and we pushed this a lot harder). Updating to 6.6.1 was better on M1 Pro (have not tested M2 yet) but got xfs corruption at the very end.

With 6.6.1 I also disabled sleeping on guest:

sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target

(from hint here)

VM Type	Kernel	Cores	RM (GB)	Where	Attempt 1	Attempt 2	Attempt 3	Host Processor
qemu	5.14	6	24	/tmp	Crash + xfs	Crash + xfs	Crash + xfs	M1 Pro
vz	5.14	6	24	/Volumes/Lima	Crash + xfs			M1 Pro
vz	5.14	6	24	/tmp	OK			M1 Pro
qemu	5.6.10.1	6	24	/tmp	OK (but slow)			M1 Pro
vz	5.6.10.1	6	24	/Volumes/Lima	Crash + xfs			M1 Pro
vz	5.6.10.1	6	24	/tmp	Crash a	Crash b		M1 Pro
vz	6.6.1	6	24	/tmp	xfs			M1 Pro
vz	6.6.2-1	4	12	/home/emoyse.linux	xfs			M1 Pro

Notes:

xfs means xfs corruption was reported.
Once xfs corruption has occurred, I trash the VM and restart
Often crash is preceded in dmesg by e.g "hrtimer: interrupt took 32332585ns"

crash a in /var/log/messages I see :

978.3062161 BUG: Bad rss-counter state mm:0000000076c5940f type:M_FILEPAGES val: 402
[978.3067761 BUG: Bad rss-counter state mm:0000000076c5940f type:MM_ANONPAGES val:206
978.3071421 BUG: non-zero pgtables_bytes on freeing mm: 69632
[+0.0116951 BUG: Bad rss-counter state mm:0000000076c5940f type:MM FILEPAGES val: 402

crash b I see:

Nov 7 16:44:19 lima-myalma92 kernel: BUG: workqueue lockup - pool cpus=5 node=0 flags=0x0 nice=0 stuck for 2164s!
Nov 7 16:44:19 lima-myalma92 kernel: Showing busy workqueues and worker pools:
Nov 7 16:44:19 lima-myalma92 kernel: workqueue events: flags=0x0
Nov 7 16:44:19 lima-myalma92 kernel: pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
Nov 7 16:44:19 lima-myalma92 kernel: pending: drm_fb_helper_damage_work [drm_kms_helper]
Nov 7 16:44:19 lima-myalma92 kernel: workqueue mm_percpu_wq: flags=0x8
Nov 7 16:44:19 lima-myalma92 kernel: pwq 10: cpus=5 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
Nov 7 16:44:19 lima-myalma92 kernel: pending: vmstat_update

for the last run with 6.6.1 it all completed fine and looked great, but then I got:
```
[emoyse@lima-alma9661c6 tmp]$ ls
bash: /usr/bin/ls: Input/output error
```
And in the display I see:

wdormann commented 12 months ago

FWIW, I've added some test results and comments here: https://github.com/utmapp/UTM/issues/4840#issuecomment-1816886227

I've not ruled out that there is some issue with the macOS filesystem/hypervisor layer, but I've only seen corruption with a Linux VM, and not macOS or Windows doing the exact same thing, from the exact same VM disk backing. What is interesting to me is that if I take the exact same disk and reformat it as APFS instead of ExFAT, Linux 6.5.6 or 6.4.15 will not experience disk corruption. My theory is that given an unfortunate combination of speed/latency/something-else for disk backing, a Linux VM might experience disk corruption.

AkihiroSuda commented 12 months ago

My theory is that given an unfortunate combination of speed/latency/something-else for disk backing, a Linux VM might experience disk corruption.

Could you submit your insight to Apple? Probably via https://www.apple.com/feedback/macos.html

wdormann commented 12 months ago

I have, just to hedge my bets. However, if Windows, macOS, and I just recently tested FreeBSD, all work flawlessly under the exact same workload, using the same host disk backing, and only Linux has a problem, I'd say that this is a Linux problem. Not Apple.

afbjorklund commented 12 months ago

I can trigger filesystem corruption if my external disk is formatted with ExFAT

Oh, so that might be why it is mostly affecting external disks ? Did people forget to (re-)format them before using ?

EDIT: no, not so simple

"I create a separate APFS (Case-sensitive) Volume,"

EdwardMoyse commented 11 months ago

I can trigger filesystem corruption if my external disk is formatted with ExFAT

Oh, so that might be why it is mostly affecting external disks ? Did people forget to (re-)format them before using ?

EDIT: no, no so simple

"I create a separate APFS (Case-sensitive) Volume,"

And for me, I'm not using external (to the VM) disks any more - if you look at the table I posted here you will see that in the Where column, I'm mostly using /tmp to work in i.e. completely inside the VM. Using an external disk might provoke the corruption earlier, but it's certainly not the only route to it (though later kernels seem quite a bit more stable).

hasan4791 commented 11 months ago

In my case it occurs with internal disk nd very frequent on fedora images. Just create fedora vm and do dnf update, corruption happens immediately. btrfs scrub start /

EDIT: vz in my case

wdormann commented 11 months ago

Using an external disk might provoke the corruption earlier, but it's certainly not the only route to it (though later kernels seem quite a bit more stable).

I don't recall if I mentioned it here, but through eliminating variables I was able to pinpoint a configuration for a likely-to-corrupt-older-Linux-kernels situation, and that is having the VM hosted on an ExFAT-formatted partition (which just happens to be on an external disk for me). Based on how macOS/APFS works, I don't think it's even possible for me to test how ExFAT might perform on my internal disk. At least not without major reconfiguration of my system drive.

If others are able to reproduce the disk corruption without relying on ExFAT at the host level, that at least helps eliminate the ExFAT-layer possibility of where the problem lies. At least for me, I've been able to avoid the problem by reformatting my external disk to APFS, as that seems to tweak at least one of the required variables to see this bug happen. At least if the Linux kernel version is new enough.

At a conceptual level, it is indeed possible that Linux may be doing nothing wrong at all. In other words, it could be possible that Linux just happens to be unlucky enough to express the disk usage patterns that can trigger a bug that presents symptoms as a corrupted (BTRFS in my case) file system. But I suspect that being able to positively acknowledge the difference between a somewhat unlikely to see Linux data corruption bug and a bug at the macOS hypervisor / storage level is probably beyond my skill set.

wdormann commented 11 months ago

Ok, just to throw a wrench into the works, I did notice my FreeBSD VM eventually experiencing disk corruption, but only after about a day or so of running the stress test. As opposed to the minute or two that it takes for Linux to corrupt itself.

The same VM clone but running from an APFS filesystem seems fine:

mbentley commented 11 months ago

So it seems like there are a lot of references to people mentioning issues related to external disks and non-APFS filesystems. I am using the internal disk on my m2 mini with the default APFS filesystem and I've experienced disk corruption once but haven't specifically been able to force it to be reproduced but I haven't tried very hard to be honest but I did want to point out that maybe external disks and other filesystems may not be the specific cause but may just be easier to trigger compared to internal APFS.

I run Debian Bookworm and after repairing the filesystem with a fsck I did also upgrade my kernel from linux-image-cloud-arm64 6.1.55-1 to 6.5.3-1~bpo12+1 in backports.

afbjorklund commented 11 months ago

The above table also lists corrupting when running with qemu/hvf, so it might not even be unique to vz...

EdwardMoyse commented 11 months ago

It is not unique to vz, and it is not unique to external disks.

With Almalinux 9.2 + kernel 6.6.2-1 I just got corruption with sudo yum update -y

:-(

EdwardMoyse commented 11 months ago

Okay, I updated the title and the original comment to hopefully clarify that this is a problem with every conceivable permutation of lima.

Unfortunately for me lima is completely unusable at the moment, and so for the moment I'm giving up.

wpiekutowski commented 11 months ago

I can reproduce this with 2 methods: stress-ng --iomix 4 (for filesystems with data checksums) and parallel cp of big files and then sha256sum *. Details: https://github.com/utmapp/UTM/issues/4840#issuecomment-1821561359

Are you able to reproduce this as well?

afbjorklund commented 11 months ago

Okay, I updated the title and the original comment to hopefully clarify that this is a problem with every conceivable permutation of lima.

It still seems to be unique to one operating system and one hardware architecture, though? Maybe even Apple's issue.

EdwardMoyse commented 11 months ago

Okay, I updated the title and the original comment to hopefully clarify that this is a problem with every conceivable permutation of lima.

It still seems to be unique to one operating system and one hardware architecture, though? Maybe even Apple's issue.

Sorry, yes. I was being very single-minded in my statement above! I will rephrase the title.

AkihiroSuda commented 11 months ago

The above table also lists corrupting when running with qemu/hvf, so it might not even be unique to vz...

This issue might be worth reporting to https://gitlab.com/qemu-project/qemu/-/issues too, if the issue is reproducible with bare QEMU (without using Lima)

wdormann commented 11 months ago

At the risk of further fragmentation of the discussion of this issue, but at the potential benefit of getting the right eyeballs, I've filed: https://gitlab.com/qemu-project/qemu/-/issues/1997

(i.e., yes this can be reproduced with QEMU, as opposed to the Apple Hypervisor Framework)

AkihiroSuda commented 11 months ago

This may fix the issue for vz:

https://github.com/lima-vm/lima/pull/2026

( Thanks to @wpiekutowski https://github.com/utmapp/UTM/issues/4840#issuecomment-1824340975 @wdormann https://github.com/utmapp/UTM/issues/4840#issuecomment-1824542732 )

EdwardMoyse commented 11 months ago

Oh wow - I've run my test twice with the patched version of lima and no corruption or crashes! From reading the ticket, it's more a workaround than a complete fix, but I'll happily take it! Thanks @AkihiroSuda

lima-vm / lima

VM disk corruption with Apple Silicon #1957

Description