mesa store path is haunted

yu-re-ka commented 5 months ago

Now the third person describing an issue where after a rebuild, graphics acceleration does not work. After debugging it turns out the mesa driver has some regions zeroed out.

This was originally reported in the Lix issue tracker since the first two known cases were with Lix, however a third case has appeared which was running CppNix.

A hypothesis is that high load/contention on the builder would increase the likelihood of this issue manifesting. When rebuilding an equivalent mesa derivation afterwards it always turned out fine. nixos-apple-silicon users will usually build the kernel and mesa at the same time. Currently the fault has not been isolated. It could be in the kernel, in the toolchain, in the mesa build system...

https://git.lix.systems/lix-project/lix/issues/248

flokli commented 5 months ago

Copied my report over from there:

I ran into this as well, without running Lix anywhere (yet) - nix (Nix) 2.18.2

13:14 <flokli> tpw_rules: yuka: hmh, did a nixos update to latest master, and the graphical session doesn't come up. gdm crashes X, eglinfo shows SIGILL in loader_bind_extensions.
13:14 <flokli> running sway as root works
[…]
13:15 <yuka> quick check: "for i in /nix/store/*-mesa-24.1.0-drivers/lib/dri/apple_dri.so ; do cat $i | xz -9 | wc -c ; done"
13:15 <yuka> if any of the paths has a significantly lower entropy than the others, your store is haunted
13:15 <yuka> if this is the case it would be tremendously useful because it means this issue is not lix specific
13:16 <flokli> for i in /nix/store/*-mesa-24.1.0-drivers/lib/dri/apple_dri.so ; do cat $i | xz -9 | wc -c ; done 
13:16 <flokli> 3175560
13:16 <flokli> 3039308
13:16 <flokli> 3176884
13:16 <flokli> 2513936
13:16 <yuka> yeah that looks suspicious
13:16 <yuka> let me guess, the one with 2.5M is the one referenced by your current system?
13:17 <yuka> (add a "echo $i" in the loop to find out which one it is)
13:17 <flokli> yes

I cannot confirm 100% it did indeed get built on this machine, or another aarch64 machines, as I have a bunch of remote builders configured, but it definitely doesn't seem Lix-specific.

tpwrules commented 5 months ago

Here are two more manifestations with CppNix (not 100% sure which version) from a few months ago: https://github.com/tpwrules/nixos-apple-silicon/issues/156 (along with one corrupted file)

Is the bad region similar? I try to bang on Mesa a little bit after each release but have not managed to personally replicate the problem on my 64GB M1 Max with ext4. I have had one or two non-deterministic builds but haven't captured a good and a bad build to compare because I didn't realize the necessity at the time.

Help tracking/correlating this would be greatly appreciated. I don't think I've heard of it happening with Mesas labeled as 23, but it has evidently persisted across a couple Asahi patch releases.

flokli commented 5 months ago

Here's my cursed store path:

v8z97d2vgyc1zn5bh5mwmywk5dvsarzs-mesa-24.1.0-drivers.tar.gz

That's been with 4fac534b775aa0c40611257fa19ab8ab3243f4dc and nixpkgs 2ec060b94ebd81598603bb5ea49455e255928f9c.

Build log:

v8z97d2vgyc1zn5bh5mwmywk5dvsarzs-mesa-24.1.0-drivers.log.gz

lf- commented 5 months ago

fyi our one bad copy has the zeroed region aligned with a section (but not aligned to a page), if that's helpful. you can throw yours in ghidra and check if it's the same, perhaps?

repeating what i said on the other thread: this will be likely possible to consistently reproduce if we have a build directory from a bad copy of mesa with a verbose build log. you can repro with nix build --rebuild in a loop vs a good copy with added load with stress.

flokli commented 5 months ago

I was hoping someone is gonna do the correlation with the other data for me 😄 not sure I'll get to debug this too much more in the next few days.

puckipedia commented 5 months ago

fwiw, 00590000 through 00682000 are zeroed, and loader_bind_extensions is right in the middle of that.

alyssais commented 5 months ago

I think I'm having this same problem on my musl system. In that case, Weston just segfaults — presumably musl's dynamic linker is less resilient than this than glibc?

Working mesa:

Dynamic section at offset 0x1a993b8 contains 43 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libglapi.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libdrm.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libLLVM-17.so]
 0x0000000000000001 (NEEDED)             Shared library: [libexpat.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libz.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libzstd.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libxcb-dri3.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libsensors.so.5]
 0x0000000000000001 (NEEDED)             Shared library: [libdrm_radeon.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libelf.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libdrm_amdgpu.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libdrm_nouveau.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so]
 0x000000000000000e (SONAME)             Library soname: [libgallium_dri.so]
 0x000000000000001d (RUNPATH)            Library runpath: [/nix/store/1y3ylzq54isc3gv3c9z3wzmsn9zhkava-mesa-24.0.5/lib:/nix/store/zj2ngf5vys1q1vighya96509pq2x61id-llvm-17.0.6-lib/lib:/nix/store/sa9hg23bpay7hcj4356hp89k292g70pv-expat-2.6.2/lib:/nix/store/yp1yccxyw8vybxly9ilnc4jivg694p5x-zlib-1.3.1/lib:/nix/store/qnni1gp9lw0gkk3ayx4jcgphk9jgzr12-libxcb-1.16/lib:/nix/store/pkvyh5s1yh1p6qygzm8vkriyx5xf549q-zstd-1.5.6/lib:/nix/store/gaar581489n42j30dgc7bfncd32swq4p-lm-sensors-3.6.0/lib:/nix/store/v3g582x5c85xipldx4r2d1qfyyv54ahs-elfutils-0.191/lib:/nix/store/kwfvy4wankdca4x1cfk96w3671mkx6cl-libdrm-2.4.120/lib:/nix/store/98jcl1340c5jhxd3qw6014b12ij7395v-musl-1.2.3/lib:/nix/store/8kyc73l5j7y2k7b47ls3lzpx86i4z93c-gcc-13.2.0-lib/lib:/nix/store/y7njb653ncgil58wgwk6brj28f7q6y6v-mesa-24.0.5-drivers/lib:/nix/store/z3d3vy8ssi8790xsk7awfqigmnv8idgr-vulkan-loader-1.3.280.0/lib]
 0x000000000000000c (INIT)               0xf0d20
 0x000000000000000d (FINI)               0x118b454
 0x0000000000000019 (INIT_ARRAY)         0x19c8d98
 0x000000000000001b (INIT_ARRAYSZ)       320 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x19c8ed8
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x0000000000000004 (HASH)               0x1cb5488
 0x000000006ffffef5 (GNU_HASH)           0x17b8
 0x0000000000000005 (STRTAB)             0x1cc0000
 0x0000000000000006 (SYMTAB)             0x1978
 0x000000000000000a (STRSZ)              22448 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x1a996a8
 0x0000000000000002 (PLTRELSZ)           18384 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0xec550
 0x0000000000000007 (RELA)               0xc2a0
 0x0000000000000008 (RELASZ)             918192 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000000000001e (FLAGS)              BIND_NOW
 0x000000006ffffffb (FLAGS_1)            Flags: NOW
 0x000000006ffffffe (VERNEED)            0xc220
 0x000000006fffffff (VERNEEDNUM)         3
 0x000000006ffffff0 (VERSYM)             0xbb6a
 0x000000006ffffff9 (RELACOUNT)          37790
 0x0000000000000000 (NULL)               0x0

Bad mesa:

Dynamic section at offset 0x1d30000 contains 2 entries:
  Tag        Type                         Name/Value
 0x000000000000001d (RUNPATH)            Library runpath: [:/nix/store/fl1wirq9vp7jh985jqfj7bn9ynss49vk-mesa-24.0.5-drivers/lib:/nix/store/cl0mgj27variyndyij4f2jf1fp7jhxx9-vulkan-loader-1.3.280.0/lib]
 0x0000000000000000 (NULL)               0x0

More info:

Nix 2.22 btrfs

I'm building on NixOS on Apple Silicon, but this is a VM image — it's running in QEMU and the VM system doesn't use nixos-apple-silicon at all.

I'm also not using the standard asahi kernel on the host — I'm running kvm-arm64/nv-6.8-nv2-only from https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/.

Store path too big to upload to GitHub.

lf- commented 5 months ago

hi @marcan, this may interest you by this point, given we have multiple people hitting it.

tpwrules commented 5 months ago

@alyssais Just to be clear, the VM system is running on a machine which uses NixOS on Apple Silicon, but with a custom kernel? Are you building the Asahi Mesa or the standard nixpkgs one? The VM uses the standard NixOS kernel?

alyssais commented 5 months ago

Host machine uses NixOS on Apple Silicon with the aforementioned custom kernel, which is based on an older version of the Asahi kernel. VM uses the standard NixOS kernel with some config modifications, and standard mesa. Asahi mesa is not involved in the system at all — I just use simpledrm on the host.

Edit: I'm also using a fairly standard NixOS kernel config on the host, as opposed to the custom one from nixos-apple-silicon.

lf- commented 5 months ago

puck thinks this might be a patchelf bug which would explain it only hitting mesa and on NixOS.

puckipedia commented 5 months ago

so, my current working theory is that this is not patchelf, but a repeat of a previous issue; tho i'm not entirely sure why it doesn't affect x86_64, as it should have the same bug.

Since NixOS/nixpkgs#207101, strip is parallelised. This already turned out to be a problem, as strip running multiple processes simultaneously on the same file has caused issues on aarch64 before; see NixOS/nixpkgs#246147. As I was trying to debug this, I built the same store path on a Hetzner aarch64 VPS, as well as locally using qemu-user. What I noticed was that the strtab of the natively-built mesa's dri was a size 8, rather than a size 0x165199 when built under qemu-user. The rest of the ELF was identical. After a bit more digging, it turns out that mesa's dri driver is hard-linked to every single {foo}_dri.so path. This is eventually deduplicated, but this is done after running strip. Which means that, in effect, it was running strip over the same file, once again, repeating issue 246147. In my case, this showed up as the strtab being truncated, but I could imagine this showing up differently for other people (e.g. a part of a section missing, but always section-aligned, because that's how it's written by binutils).

This issue is exacerbated by the fact that strip errors aren't printed as long as at least one file has been successfully stripped, hiding the myriad of {foo}_dri.so[.eh_frame]: invalid operation and similar errors.

I believe that moving the symlink deduping logic from postFixup to preFixup is likely to solve this issue (and, of course, fixing this properly in the strip hook); but as I don't have a real aarch64 device to test with, I leave the implementation of this suggestion to others :)

puckipedia commented 5 months ago

I've been pointed at https://github.com/void-linux/void-packages/blob/master/srcpkgs/mesa/patches/megadriver-symlinks.patch - which is likely to be a better solution here; just patching the megadriver installer to symlink, rather than hardlink.

alyssais commented 5 months ago

It looks like Mesa uses hard links for a reason — to avoid installing the megadriver under its original non-driver-specific name, and I think it makes sense not to change that, so based on my current understanding I'd prefer moving our symlinkification to preFixup rather than applying Void's patch.

@dcbaker do you have any thoughts here, ooc?

minego commented 5 months ago

While we're waiting for agreement on how to proceed to fix this, does anyone have instructions on how someone can work around this if they run into it?

I believe I ran into this last night, and my nix-fu is not strong enough to know how to force rebuliding mesa... (but I'm eager to learn!)

lf- commented 5 months ago

While we're waiting for agreement on how to proceed to fix this, does anyone have instructions on how someone can work around this if they run into it?

I believe I ran into this last night, and my nix-fu is not strong enough to know how to force rebuliding mesa... (but I'm eager to learn!)

I think the fix here is to do something (this is evil btw) like:

nix path-info --derivation /nix/store/gsdfkhlsdfglhk-mesa
(prints some derivation path)
nix store delete --ignore-liveness STORE-PATH-DRV
nix build STORE-PATH-DRV

yu-re-ka commented 5 months ago

I can not recommend doing anything with --ignore-liveness, it is too easy to fuck up your entire system with that.

Instead, I suggest adding an overlay that makes a meaningless change to the derivation (but changes the derivation hash):


nixpkgs.overlays = [
  (final: prev: {
    mesa-asahi-edge = prev.mesa-asahi-edge.overrideAttrs (oldAttrs: {
      src = lib.cleanSource oldAttrs.src;
    });
  })
];

minego commented 5 months ago

Wish I'd tried that approach first lol

I was attempting a few things with that flag and, totally borked my install! Luckily booting a rescue image is easy with NixOS and it is fixed now. I wiped my whole nix store and started fresh, and mesa built properly this time.

On Thu, May 23, 2024, at 6:08 PM, Yureka wrote:

I can not recommend doing anything with --ignore-liveness, it is too easy to fuck up your entire system with that.

Instead, I suggest adding an overlay that makes a meaningless change to the derivation (but changes the derivation hash):

nixpkgs.overlays = [ (final: prev: { mesa-asahi-edge = prev.mesa-asahi-edge.overrideAttrs (oldAttrs: { src = lib.cleanSource oldAttrs.src; }); }) ];

— Reply to this email directly, view it on GitHub https://github.com/tpwrules/nixos-apple-silicon/issues/199#issuecomment-2128257603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADSRCH66BFIVVBRQQJHNFLZD2AIZAVCNFSM6AAAAABHUG6TBWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRYGI2TONRQGM. You are receiving this because you are subscribed to this thread.Message ID: @.***>

tpwrules commented 5 months ago

Has an upstream nixpkgs fix to rearrange the mesa derivation been filed?

Can the strip hook be taught to ignore symlinks and only access unique inodes?

tpwrules commented 5 months ago

Some late night bash(1)ing to solve the latter: https://github.com/NixOS/nixpkgs/pull/314175

Will actually test over the weekend that it works with nixpkgs; I still haven't managed to replicate the haunting on my machine.

lheckemann commented 5 months ago

@lf- nix-store --repair-path is the more "canonical" way to do that, I believe?

yu-re-ka commented 5 months ago

Any 'repair' related commands only do something if the hash recorded in the nix store db does not match the path contents on disk. But in this case the path is not 'corrupted' in the nix-store sense, as the disk contents match the nix store db.

tpwrules commented 5 months ago

A fix for the underlying issue has been merged into nixpkgs and will take a week or two to make it to a release here. I am encouraged to wait for that instead of attempting to modify our Mesa build as the rate of occurrence does not seem too high.

I will keep track and close this issue once that release happens. Thanks very much to all again for debugging.

lheckemann commented 5 months ago

Huh, I could have sworn it remade the path unconditionally, but apparently not. Sorry for the noise.

skylarmb commented 5 months ago

I was having this issue on a fresh install with the latest ISO. I tried the older 2024-04-20 release and all is well (at least after configuring hardware.asahi and hardware.opengl). Looking forward to hearing an update on this issue.

flokli commented 5 months ago

The fixes have been merged to nixpks, but it'll take a while to land. Check https://nixpk.gs/pr-tracker.html?pr=314175 and https://nixpk.gs/pr-tracker.html?pr=314541 for when the fix ends up in the unstable and 24.05 channels respectively.

zvolin commented 4 months ago

unstable is there already

tpwrules commented 4 months ago

This fix is in unstable and also the latest release. Stable will be another week or two, but I'm considering this fixed.

dcbaker commented 4 months ago

@alyssais sorry for missing this, I was on a github hiatus. The hardlinks are basically about space savings, and there was this thought that back in the day that a distro might want to update a single driver if that meant that they wouldn't have to update the entire mesa package (something that in reality only Debian could do).

alyssais commented 4 months ago

@dcbaker so is there any reason for them not to be symlinks to a megadriver.so?

tpwrules / nixos-apple-silicon

mesa store path is haunted #199