Closed huonw closed 3 months ago
Maybe the psutil upgrade in #20760 would help paper over this, but there's probably some underlying issue... what's causing x86-64 code to end up in a macos_11_0_arm64
wheel?
I've made a few changes to the release metadata:
Using #20766 to try to introspect the machine:
psutil-5.9.0-cp39-cp39-macosx_11_0_arm64.whl
within ~/.cache/pants
2024-04-05T20:21:01-07:00
= 2024-04-06T03:21:01Z
.2024-04-06T03:10Z
, which correlates well...whl
(https://github.com/pantsbuild/pants/actions/runs/8578506645/job/23512431601#step:9:377).2024-04-05T03:26Z
), so that doesn't seem relevant.The logs in the last one are not particularly revealing, so maybe finding the build in question won't be helpful... but also, we don't have much to go on other than that.
oooooof
Is this a rosetta thing somehow?
Probably, yes.
I did a bunch of tests in #20766, which observed:
pip install --no-cache psutil==5.9.0
reproduces the problem (the .so
objects it builds are x86-64) consistentlyclang test.c -o test
) builds x86-64 by defaultarch
prints i386
arch -x86_64 python -m pip install --no-cache psutil==5.9.0 --target /tmp/psutil2
(after doing that file /tmp/psutil2/**/*.so
prints ... bundle x86_64
) arch -x86_64 zsh -c 'python ...'
.arch
this works as expected, and builds arm64
.so
objects)So, this suggests the runners are potentially running all GitHub jobs in x86-64/Rosetta mode by default, for some reason. I couldn't really find anything online, but theories might be:
zsh
it'll choose the x86-64 slice (and then zsh
invoking python
invoking pip
invoking clang
will do the same)?
arm64
Mach-O binaries (if one can find the directory with it installed, file bin/* | grep Mach
can confirm the architecture, I think)./run.sh
script is in x86-64
?I think someone who can poke at the machine live via ssh (and/or knows more about self-hosted runners) might need to take over the investigation here.
Key unanswered question: what changed that broke this?
I think #20760 might help this specific instance, but is only papering over the problem: we'll still be running in i386 mode and so any future wheel builds will have the same problem (and there's no guarantee that our testing is correct/doing what we expect)!
Should we run all CI commands under arch
, to force the arch to arm64? We do set ARCHFLAGS
in those jobs, maybe something has changed on github's side that overrides that? What happens if we run an older build that did work now? If it fails it means changes on github's side, if it passes, it means we broke something on our end.
Should we run all CI commands under arch, to force the arch to arm64?
I guess that would work, but I'd be a bit concerned. We'd have to remember to add it everywhere (including in other repos, like scie-pants
) or risk random cache inconsistencies if we miss a place...
We do set ARCHFLAGS in those jobs
Ah cool, I hadn't noticed that. That does seem to do something locally and in CI ✅
I notice that scie-pants
doesn't set this in its CI https://github.com/search?q=repo%3Apantsbuild%2Fscie-pants%20ARCHFLAGS&type=code. (It looks like these are the only two repos that use this runner, in the pants
repo: https://github.com/search?q=org%3Apantsbuild%20macOS-11-ARM64&type=code)
So that suggests a theory to me:
However, I cannot find the specific job that seeded the cache, so I'm not 100% sure. A possibility to validate this would be:
This will be destructive to any state that might be related if we need it for further debugging... but I think this is fine
What happens if we run an older build that did work now?
I did a test build of the 2.19.1 code in #20766: https://github.com/pantsbuild/pants/actions/runs/8623880391?pr=20766
The PEX artefact attached there is broken, and different to the real 2.19.1 code:
If it fails it means changes on github's side, if it passes, it means we broke something on our end.
I think a failure could also be an out-of-band change to the machine's persistent state like OS upgrades or any other activities a human might do, or... following the theory above, seeding the cache with incorrect settings.
A possibility to validate this would be
I've done this.
The artifacts are still broken ❌
I theorise that the ARCHFLAGS: -arch arm64
settings in this repo aren't doing what we expect / what they used to.
I'll forge ahead with validating whether upgrading to a psutil
version with wheels unblocks releases (#20775), and apply that to the other release branches too (#20773, #20774).
Taking a different tack, I configured my laptop (M1, running macOS 14.4) as a self-hosted runner in a private repo just now, and ran the workflow from https://github.com/pantsbuild/pants/pull/20766 on it, which has various tests like pip install psutil
and arch
and clang
invocations. My laptop gives me behaviour closer to what I'd hope, running ARM64 by default:
command | behaviour on laptop | behaviour on pants CI |
---|---|---|
arch |
arm64 ✅ |
x86_64 ❌ |
pip install psutil==5.9.0 --no-cache |
.so files are arm64 ✅ |
x86_64 ❌ |
pip install ... with ARCHFLAGS=-arch arm64 |
.so files are arm64 ✅ |
arm64 ✅ |
clang ... |
arm64 output ✅ |
x86_64 output ❌ |
actions runner version | 2.315.0 ✅ | 2.315.0 ✅ |
So, maybe we could resolve this issue (and the need to use arch
or ARCHFLAGS
everywhere in a reliable way) by: working out how to get the self-hosted runner to be running jobs as arm64 by default.
Ah, further note, the .deps/pantsbuild.pants-...-cp39-cp39-macosx_11_0_arm64.whl/native_engine.cpython-39-darwin.so
file (compiled from dummy.c
, and, importantly, not cached in the same way as the psutil wheel) is currently x86_64
, but used to be arm64
, and it changed between these two releases:
https://github.com/pantsbuild/pants/compare/release_2.17.0.dev5...release_2.17.0a0
These releases involved significant changes to the infrastructure/release process, so it seems plausible that they may've regressed the ARCHFLAGS
handling.
I've put out 2.19.3rc0, 2.20.0rc4 and 2.21.0.dev6 with the psutil wheel workaround, that all seem to work on my arm64 macOS machine.
I think we've resolved the acute issue caused by the specific dependency with native code but no wheel, and the runner settings mean that was being built incorrectly. The resolution: update to psutil==5.9.8
which does have a wheel for arm64 macOS & cherry-pick that to all live branches.
However, the underlying bug is still here. I've filed https://github.com/pantsbuild/pants/issues/20790 for that.
Describe the bug
Running any of the versions published recently on arm64 macOS hits an error attempting to import
psutil
. For instancePANTS_VERSION=2.19.2 pants
:ImportError: dlopen(/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/psutil/_psutil_osx.cpython-39-darwin.so, 0x0002): tried: '/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/psutil/_psutil_osx.cpython-39-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e' or 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/psutil/_psutil_osx.cpython-39-darwin.so' (no such file), '/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/psutil/_psutil_osx.cpython-39-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e' or 'arm64'))
Workaround: use an earlier release. NB. for 2.19.2, 2.19.2rc0 has essentially the same code (only a documentation difference).
Checking the release PEXes confirms that they contain x86-64 objects, for instance, for 2.19.2:
NB. this is likely related to https://github.com/pantsbuild/pants/issues/20759 and potentially https://github.com/pantsbuild/pants/pull/20756.
Pants version
OS macOS
Additional info
We'll presumably want to do a fast-follow 2.19.3 stable release once we fix this, and this blocks 2.20.0 stable too.