Closed yu-re-ka closed 4 months ago
Copied my report over from there:
I ran into this as well, without running Lix anywhere (yet) - nix (Nix) 2.18.2
13:14 <flokli> tpw_rules: yuka: hmh, did a nixos update to latest master, and the graphical session doesn't come up. gdm crashes X, eglinfo shows SIGILL in loader_bind_extensions.
13:14 <flokli> running sway as root works
[…]
13:15 <yuka> quick check: "for i in /nix/store/*-mesa-24.1.0-drivers/lib/dri/apple_dri.so ; do cat $i | xz -9 | wc -c ; done"
13:15 <yuka> if any of the paths has a significantly lower entropy than the others, your store is haunted
13:15 <yuka> if this is the case it would be tremendously useful because it means this issue is not lix specific
13:16 <flokli> for i in /nix/store/*-mesa-24.1.0-drivers/lib/dri/apple_dri.so ; do cat $i | xz -9 | wc -c ; done
13:16 <flokli> 3175560
13:16 <flokli> 3039308
13:16 <flokli> 3176884
13:16 <flokli> 2513936
13:16 <yuka> yeah that looks suspicious
13:16 <yuka> let me guess, the one with 2.5M is the one referenced by your current system?
13:17 <yuka> (add a "echo $i" in the loop to find out which one it is)
13:17 <flokli> yes
I cannot confirm 100% it did indeed get built on this machine, or another aarch64 machines, as I have a bunch of remote builders configured, but it definitely doesn't seem Lix-specific.
Here are two more manifestations with CppNix (not 100% sure which version) from a few months ago: https://github.com/tpwrules/nixos-apple-silicon/issues/156 (along with one corrupted file)
Is the bad region similar? I try to bang on Mesa a little bit after each release but have not managed to personally replicate the problem on my 64GB M1 Max with ext4. I have had one or two non-deterministic builds but haven't captured a good and a bad build to compare because I didn't realize the necessity at the time.
Help tracking/correlating this would be greatly appreciated. I don't think I've heard of it happening with Mesas labeled as 23, but it has evidently persisted across a couple Asahi patch releases.
Here's my cursed store path:
v8z97d2vgyc1zn5bh5mwmywk5dvsarzs-mesa-24.1.0-drivers.tar.gz
That's been with 4fac534b775aa0c40611257fa19ab8ab3243f4dc and nixpkgs 2ec060b94ebd81598603bb5ea49455e255928f9c.
Build log:
fyi our one bad copy has the zeroed region aligned with a section (but not aligned to a page), if that's helpful. you can throw yours in ghidra and check if it's the same, perhaps?
repeating what i said on the other thread: this will be likely possible to consistently reproduce if we have a build directory from a bad copy of mesa with a verbose build log. you can repro with nix build --rebuild in a loop vs a good copy with added load with stress
.
I was hoping someone is gonna do the correlation with the other data for me 😄 not sure I'll get to debug this too much more in the next few days.
fwiw, 00590000 through 00682000 are zeroed, and loader_bind_extensions
is right in the middle of that.
I think I'm having this same problem on my musl system. In that case, Weston just segfaults — presumably musl's dynamic linker is less resilient than this than glibc?
Working mesa:
Dynamic section at offset 0x1a993b8 contains 43 entries:
Tag Type Name/Value
0x0000000000000001 (NEEDED) Shared library: [libglapi.so.0]
0x0000000000000001 (NEEDED) Shared library: [libdrm.so.2]
0x0000000000000001 (NEEDED) Shared library: [libLLVM-17.so]
0x0000000000000001 (NEEDED) Shared library: [libexpat.so.1]
0x0000000000000001 (NEEDED) Shared library: [libz.so.1]
0x0000000000000001 (NEEDED) Shared library: [libzstd.so.1]
0x0000000000000001 (NEEDED) Shared library: [libxcb-dri3.so.0]
0x0000000000000001 (NEEDED) Shared library: [libsensors.so.5]
0x0000000000000001 (NEEDED) Shared library: [libdrm_radeon.so.1]
0x0000000000000001 (NEEDED) Shared library: [libelf.so.1]
0x0000000000000001 (NEEDED) Shared library: [libdrm_amdgpu.so.1]
0x0000000000000001 (NEEDED) Shared library: [libdrm_nouveau.so.2]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so]
0x000000000000000e (SONAME) Library soname: [libgallium_dri.so]
0x000000000000001d (RUNPATH) Library runpath: [/nix/store/1y3ylzq54isc3gv3c9z3wzmsn9zhkava-mesa-24.0.5/lib:/nix/store/zj2ngf5vys1q1vighya96509pq2x61id-llvm-17.0.6-lib/lib:/nix/store/sa9hg23bpay7hcj4356hp89k292g70pv-expat-2.6.2/lib:/nix/store/yp1yccxyw8vybxly9ilnc4jivg694p5x-zlib-1.3.1/lib:/nix/store/qnni1gp9lw0gkk3ayx4jcgphk9jgzr12-libxcb-1.16/lib:/nix/store/pkvyh5s1yh1p6qygzm8vkriyx5xf549q-zstd-1.5.6/lib:/nix/store/gaar581489n42j30dgc7bfncd32swq4p-lm-sensors-3.6.0/lib:/nix/store/v3g582x5c85xipldx4r2d1qfyyv54ahs-elfutils-0.191/lib:/nix/store/kwfvy4wankdca4x1cfk96w3671mkx6cl-libdrm-2.4.120/lib:/nix/store/98jcl1340c5jhxd3qw6014b12ij7395v-musl-1.2.3/lib:/nix/store/8kyc73l5j7y2k7b47ls3lzpx86i4z93c-gcc-13.2.0-lib/lib:/nix/store/y7njb653ncgil58wgwk6brj28f7q6y6v-mesa-24.0.5-drivers/lib:/nix/store/z3d3vy8ssi8790xsk7awfqigmnv8idgr-vulkan-loader-1.3.280.0/lib]
0x000000000000000c (INIT) 0xf0d20
0x000000000000000d (FINI) 0x118b454
0x0000000000000019 (INIT_ARRAY) 0x19c8d98
0x000000000000001b (INIT_ARRAYSZ) 320 (bytes)
0x000000000000001a (FINI_ARRAY) 0x19c8ed8
0x000000000000001c (FINI_ARRAYSZ) 8 (bytes)
0x0000000000000004 (HASH) 0x1cb5488
0x000000006ffffef5 (GNU_HASH) 0x17b8
0x0000000000000005 (STRTAB) 0x1cc0000
0x0000000000000006 (SYMTAB) 0x1978
0x000000000000000a (STRSZ) 22448 (bytes)
0x000000000000000b (SYMENT) 24 (bytes)
0x0000000000000003 (PLTGOT) 0x1a996a8
0x0000000000000002 (PLTRELSZ) 18384 (bytes)
0x0000000000000014 (PLTREL) RELA
0x0000000000000017 (JMPREL) 0xec550
0x0000000000000007 (RELA) 0xc2a0
0x0000000000000008 (RELASZ) 918192 (bytes)
0x0000000000000009 (RELAENT) 24 (bytes)
0x000000000000001e (FLAGS) BIND_NOW
0x000000006ffffffb (FLAGS_1) Flags: NOW
0x000000006ffffffe (VERNEED) 0xc220
0x000000006fffffff (VERNEEDNUM) 3
0x000000006ffffff0 (VERSYM) 0xbb6a
0x000000006ffffff9 (RELACOUNT) 37790
0x0000000000000000 (NULL) 0x0
Bad mesa:
Dynamic section at offset 0x1d30000 contains 2 entries:
Tag Type Name/Value
0x000000000000001d (RUNPATH) Library runpath: [:/nix/store/fl1wirq9vp7jh985jqfj7bn9ynss49vk-mesa-24.0.5-drivers/lib:/nix/store/cl0mgj27variyndyij4f2jf1fp7jhxx9-vulkan-loader-1.3.280.0/lib]
0x0000000000000000 (NULL) 0x0
More info:
Nix 2.22 btrfs
I'm building on NixOS on Apple Silicon, but this is a VM image — it's running in QEMU and the VM system doesn't use nixos-apple-silicon at all.
I'm also not using the standard asahi kernel on the host — I'm running kvm-arm64/nv-6.8-nv2-only from https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/.
Store path too big to upload to GitHub.
hi @marcan, this may interest you by this point, given we have multiple people hitting it.
@alyssais Just to be clear, the VM system is running on a machine which uses NixOS on Apple Silicon, but with a custom kernel? Are you building the Asahi Mesa or the standard nixpkgs one? The VM uses the standard NixOS kernel?
Host machine uses NixOS on Apple Silicon with the aforementioned custom kernel, which is based on an older version of the Asahi kernel. VM uses the standard NixOS kernel with some config modifications, and standard mesa. Asahi mesa is not involved in the system at all — I just use simpledrm on the host.
Edit: I'm also using a fairly standard NixOS kernel config on the host, as opposed to the custom one from nixos-apple-silicon.
puck thinks this might be a patchelf bug which would explain it only hitting mesa and on NixOS.
so, my current working theory is that this is not patchelf, but a repeat of a previous issue; tho i'm not entirely sure why it doesn't affect x86_64, as it should have the same bug.
Since NixOS/nixpkgs#207101, strip
is parallelised. This already turned out to be a problem, as strip
running multiple processes simultaneously on the same file has caused issues on aarch64 before; see NixOS/nixpkgs#246147. As I was trying to debug this, I built the same store path on a Hetzner aarch64 VPS, as well as locally using qemu-user. What I noticed was that the strtab
of the natively-built mesa's dri
was a size 8
, rather than a size 0x165199
when built under qemu-user. The rest of the ELF was identical. After a bit more digging, it turns out that mesa's dri
driver is hard-linked to every single {foo}_dri.so
path. This is eventually deduplicated, but this is done after running strip. Which means that, in effect, it was running strip over the same file, once again, repeating issue 246147. In my case, this showed up as the strtab
being truncated, but I could imagine this showing up differently for other people (e.g. a part of a section missing, but always section-aligned, because that's how it's written by binutils).
This issue is exacerbated by the fact that strip
errors aren't printed as long as at least one file has been successfully stripped, hiding the myriad of {foo}_dri.so[.eh_frame]: invalid operation
and similar errors.
I believe that moving the symlink deduping logic from postFixup
to preFixup
is likely to solve this issue (and, of course, fixing this properly in the strip
hook); but as I don't have a real aarch64 device to test with, I leave the implementation of this suggestion to others :)
I've been pointed at https://github.com/void-linux/void-packages/blob/master/srcpkgs/mesa/patches/megadriver-symlinks.patch - which is likely to be a better solution here; just patching the megadriver installer to symlink, rather than hardlink.
It looks like Mesa uses hard links for a reason — to avoid installing the megadriver under its original non-driver-specific name, and I think it makes sense not to change that, so based on my current understanding I'd prefer moving our symlinkification to preFixup rather than applying Void's patch.
@dcbaker do you have any thoughts here, ooc?
While we're waiting for agreement on how to proceed to fix this, does anyone have instructions on how someone can work around this if they run into it?
I believe I ran into this last night, and my nix-fu is not strong enough to know how to force rebuliding mesa... (but I'm eager to learn!)
While we're waiting for agreement on how to proceed to fix this, does anyone have instructions on how someone can work around this if they run into it?
I believe I ran into this last night, and my nix-fu is not strong enough to know how to force rebuliding mesa... (but I'm eager to learn!)
I think the fix here is to do something (this is evil btw) like:
nix path-info --derivation /nix/store/gsdfkhlsdfglhk-mesa
(prints some derivation path)
nix store delete --ignore-liveness STORE-PATH-DRV
nix build STORE-PATH-DRV
I can not recommend doing anything with --ignore-liveness
, it is too easy to fuck up your entire system with that.
Instead, I suggest adding an overlay that makes a meaningless change to the derivation (but changes the derivation hash):
nixpkgs.overlays = [
(final: prev: {
mesa-asahi-edge = prev.mesa-asahi-edge.overrideAttrs (oldAttrs: {
src = lib.cleanSource oldAttrs.src;
});
})
];
Wish I'd tried that approach first lol
I was attempting a few things with that flag and, totally borked my install! Luckily booting a rescue image is easy with NixOS and it is fixed now. I wiped my whole nix store and started fresh, and mesa built properly this time.
On Thu, May 23, 2024, at 6:08 PM, Yureka wrote:
I can not recommend doing anything with
--ignore-liveness
, it is too easy to fuck up your entire system with that.Instead, I suggest adding an overlay that makes a meaningless change to the derivation (but changes the derivation hash):
nixpkgs.overlays = [ (final: prev: { mesa-asahi-edge = prev.mesa-asahi-edge.overrideAttrs (oldAttrs: { src = lib.cleanSource oldAttrs.src; }); }) ];
— Reply to this email directly, view it on GitHub https://github.com/tpwrules/nixos-apple-silicon/issues/199#issuecomment-2128257603, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADSRCH66BFIVVBRQQJHNFLZD2AIZAVCNFSM6AAAAABHUG6TBWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRYGI2TONRQGM. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Has an upstream nixpkgs fix to rearrange the mesa derivation been filed?
Can the strip hook be taught to ignore symlinks and only access unique inodes?
Some late night bash(1)ing to solve the latter: https://github.com/NixOS/nixpkgs/pull/314175
Will actually test over the weekend that it works with nixpkgs; I still haven't managed to replicate the haunting on my machine.
@lf- nix-store --repair-path
is the more "canonical" way to do that, I believe?
Any 'repair' related commands only do something if the hash recorded in the nix store db does not match the path contents on disk. But in this case the path is not 'corrupted' in the nix-store sense, as the disk contents match the nix store db.
A fix for the underlying issue has been merged into nixpkgs and will take a week or two to make it to a release here. I am encouraged to wait for that instead of attempting to modify our Mesa build as the rate of occurrence does not seem too high.
I will keep track and close this issue once that release happens. Thanks very much to all again for debugging.
Huh, I could have sworn it remade the path unconditionally, but apparently not. Sorry for the noise.
I was having this issue on a fresh install with the latest ISO. I tried the older 2024-04-20 release and all is well (at least after configuring hardware.asahi
and hardware.opengl
). Looking forward to hearing an update on this issue.
The fixes have been merged to nixpks, but it'll take a while to land. Check https://nixpk.gs/pr-tracker.html?pr=314175 and https://nixpk.gs/pr-tracker.html?pr=314541 for when the fix ends up in the unstable and 24.05 channels respectively.
unstable is there already
This fix is in unstable and also the latest release. Stable will be another week or two, but I'm considering this fixed.
@alyssais sorry for missing this, I was on a github hiatus. The hardlinks are basically about space savings, and there was this thought that back in the day that a distro might want to update a single driver if that meant that they wouldn't have to update the entire mesa package (something that in reality only Debian could do).
@dcbaker so is there any reason for them not to be symlinks to a megadriver.so?
Now the third person describing an issue where after a rebuild, graphics acceleration does not work. After debugging it turns out the mesa driver has some regions zeroed out.
This was originally reported in the Lix issue tracker since the first two known cases were with Lix, however a third case has appeared which was running CppNix.
A hypothesis is that high load/contention on the builder would increase the likelihood of this issue manifesting. When rebuilding an equivalent mesa derivation afterwards it always turned out fine. nixos-apple-silicon users will usually build the kernel and mesa at the same time. Currently the fault has not been isolated. It could be in the kernel, in the toolchain, in the mesa build system...
https://git.lix.systems/lix-project/lix/issues/248