pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.21k stars 620 forks source link

`ImportError: ... incompatible architecture` on arm64 macOS for Pants 2.19.2, 2.20.0rc3, 2.21.0.dev5 #20765

Closed huonw closed 3 months ago

huonw commented 3 months ago

Describe the bug

Running any of the versions published recently on arm64 macOS hits an error attempting to import psutil. For instance PANTS_VERSION=2.19.2 pants:

ImportError: dlopen(/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/psutil/_psutil_osx.cpython-39-darwin.so, 0x0002): tried: '/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/psutil/_psutil_osx.cpython-39-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e' or 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/psutil/_psutil_osx.cpython-39-darwin.so' (no such file), '/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/psutil/_psutil_osx.cpython-39-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e' or 'arm64'))

Workaround: use an earlier release. NB. for 2.19.2, 2.19.2rc0 has essentially the same code (only a documentation difference).

``` $ PANTS_VERSION=2.19.2 pants Bootstrapping Pants 2.19.2 Installing pantsbuild.pants==2.19.2 into a virtual environment at /Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2 New virtual environment successfully created at /Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2. Traceback (most recent call last): File "/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/bin/pants", line 8, in entry_point = importlib.import_module(modname) File "/Users/huon/Library/Caches/nce/bf0cd90204a2cc6da48cae1e4b32f48c9f7031fbe1238c5972104ccb0155d368/cpython-3.9.18+20240107-aarch64-apple-darwin-install_only.tar.gz/python/lib/python3.9/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1030, in _gcd_import File "", line 1007, in _find_and_load File "", line 986, in _find_and_load_unlocked File "", line 680, in _load_unlocked File "", line 850, in exec_module File "", line 228, in _call_with_frames_removed File "/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/pants/bin/pants_loader.py", line 18, in from pants.bin.pants_runner import PantsRunner File "/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/pants/bin/pants_runner.py", line 14, in from pants.base.exception_sink import ExceptionSink File "/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/pants/base/exception_sink.py", line 16, in import psutil File "/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/psutil/__init__.py", line 123, in from . import _psosx as _psplatform File "/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/psutil/_psosx.py", line 14, in from . import _psutil_osx as cext ImportError: dlopen(/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/psutil/_psutil_osx.cpython-39-darwin.so, 0x0002): tried: '/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/psutil/_psutil_osx.cpython-39-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e' or 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/psutil/_psutil_osx.cpython-39-darwin.so' (no such file), '/Users/huon/Library/Caches/nce/f46d2c12132bad9c27e0dd509186c190371abfe1add7e5ca42245466f35bea81/bindings/venvs/2.19.2/lib/python3.9/site-packages/psutil/_psutil_osx.cpython-39-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e' or 'arm64')) ```

Checking the release PEXes confirms that they contain x86-64 objects, for instance, for 2.19.2:

cd $(mktemp -d)

curl -LO https://github.com/pantsbuild/pants/releases/download/release_2.19.2/pants.2.19.2-cp39-darwin_arm64.pex

unzip -q pants.2.19.2-cp39-darwin_arm64.pex

file  .deps/psutil-5.9.0-cp39-cp39-macosx_11_0_arm64.whl/psutil/_psutil_{osx,posix}.cpython-39-darwin.so
.deps/psutil-5.9.0-cp39-cp39-macosx_11_0_arm64.whl/psutil/_psutil_osx.cpython-39-darwin.so:   Mach-O 64-bit bundle x86_64
.deps/psutil-5.9.0-cp39-cp39-macosx_11_0_arm64.whl/psutil/_psutil_posix.cpython-39-darwin.so: Mach-O 64-bit bundle x86_64

NB. this is likely related to https://github.com/pantsbuild/pants/issues/20759 and potentially https://github.com/pantsbuild/pants/pull/20756.

Pants version

OS macOS

Additional info

We'll presumably want to do a fast-follow 2.19.3 stable release once we fix this, and this blocks 2.20.0 stable too.

huonw commented 3 months ago

Maybe the psutil upgrade in #20760 would help paper over this, but there's probably some underlying issue... what's causing x86-64 code to end up in a macos_11_0_arm64 wheel?

huonw commented 3 months ago

I've made a few changes to the release metadata:

huonw commented 3 months ago

Using #20766 to try to introspect the machine:

huonw commented 3 months ago

The logs in the last one are not particularly revealing, so maybe finding the build in question won't be helpful... but also, we don't have much to go on other than that.

benjyw commented 3 months ago

oooooof

Is this a rosetta thing somehow?

huonw commented 3 months ago

Probably, yes.

I did a bunch of tests in #20766, which observed:

So, this suggests the runners are potentially running all GitHub jobs in x86-64/Rosetta mode by default, for some reason. I couldn't really find anything online, but theories might be:

  1. the self-hosted runner infrastructure itself is x86-64, so when it invokes a universal binary like zsh it'll choose the x86-64 slice (and then zsh invoking python invoking pip invoking clang will do the same)?
    1. Maybe updating/reinstalling would help? Installing the runner infrastructure on my local M1 Mac just now got various arm64 Mach-O binaries (if one can find the directory with it installed, file bin/* | grep Mach can confirm the architecture, I think)
  2. there's some configuration for the runner that manages this?
  3. the process that executes the ./run.sh script is in x86-64?

I think someone who can poke at the machine live via ssh (and/or knows more about self-hosted runners) might need to take over the investigation here.

Key unanswered question: what changed that broke this?

I think #20760 might help this specific instance, but is only papering over the problem: we'll still be running in i386 mode and so any future wheel builds will have the same problem (and there's no guarantee that our testing is correct/doing what we expect)!

benjyw commented 3 months ago

Should we run all CI commands under arch, to force the arch to arm64? We do set ARCHFLAGS in those jobs, maybe something has changed on github's side that overrides that? What happens if we run an older build that did work now? If it fails it means changes on github's side, if it passes, it means we broke something on our end.

huonw commented 3 months ago

Should we run all CI commands under arch, to force the arch to arm64?

I guess that would work, but I'd be a bit concerned. We'd have to remember to add it everywhere (including in other repos, like scie-pants) or risk random cache inconsistencies if we miss a place...

We do set ARCHFLAGS in those jobs

Ah cool, I hadn't noticed that. That does seem to do something locally and in CI

Locally: ```shell ARCHFLAGS='-arch x86_64' python -m pip install --no-cache psutil==5.9.0 --target /tmp/psutil3 # ... file /tmp/psutil3/psutil/*.so # /tmp/psutil3/psutil/_psutil_osx.cpython-310-darwin.so: Mach-O 64-bit bundle x86_64 # /tmp/psutil3/psutil/_psutil_posix.cpython-310-darwin.so: Mach-O 64-bit bundle x86_64 ARCHFLAGS='-arch arm64' python -m pip install --no-cache psutil==5.9.0 --target /tmp/psutil4 # ... file /tmp/psutil4/psutil/*.so # /tmp/psutil4/psutil/_psutil_osx.cpython-310-darwin.so: Mach-O 64-bit bundle arm64 # /tmp/psutil4/psutil/_psutil_posix.cpython-310-darwin.so: Mach-O 64-bit bundle arm64 ```

I notice that scie-pants doesn't set this in its CI https://github.com/search?q=repo%3Apantsbuild%2Fscie-pants%20ARCHFLAGS&type=code. (It looks like these are the only two repos that use this runner, in the pants repo: https://github.com/search?q=org%3Apantsbuild%20macOS-11-ARM64&type=code)

So that suggests a theory to me:

However, I cannot find the specific job that seeded the cache, so I'm not 100% sure. A possibility to validate this would be:

  1. find a quiescent time when there was no other jobs (especially not scie-pants ones)
  2. clear the cache
  3. do a build with known-good flags to re-seed the cache

This will be destructive to any state that might be related if we need it for further debugging... but I think this is fine

What happens if we run an older build that did work now?

I did a test build of the 2.19.1 code in #20766: https://github.com/pantsbuild/pants/actions/runs/8623880391?pr=20766

The PEX artefact attached there is broken, and different to the real 2.19.1 code:

```shell # confirmation that the adhoc build is 2.19.1: cat /tmp/pants-adhoc-build-20766/.deps/pantsbuild.pants-2.19.1-cp39-cp39-macosx_11_0_arm64.whl/pants/_version/VERSION # 2.19.1 file /tmp/pants-adhoc-build-20766/.deps/psutil-5.9.0-cp39-cp39-macosx_11_0_arm64.whl/psutil/*.so # .deps/psutil-5.9.0-cp39-cp39-macosx_11_0_arm64.whl/psutil/_psutil_osx.cpython-39-darwin.so: Mach-O 64-bit bundle x86_64 # .deps/psutil-5.9.0-cp39-cp39-macosx_11_0_arm64.whl/psutil/_psutil_posix.cpython-39-darwin.so: Mach-O 64-bit bundle x86_64 file /tmp/pants-2.19.1/.deps/psutil-5.9.0-cp39-cp39-macosx_11_0_arm64.whl/psutil/*.so # /tmp/pants-2.19.1/.deps/psutil-5.9.0-cp39-cp39-macosx_11_0_arm64.whl/psutil/_psutil_osx.cpython-39-darwin.so: Mach-O 64-bit bundle arm64 # /tmp/pants-2.19.1/.deps/psutil-5.9.0-cp39-cp39-macosx_11_0_arm64.whl/psutil/_psutil_posix.cpython-39-darwin.so: Mach-O 64-bit bundle arm64 ```

If it fails it means changes on github's side, if it passes, it means we broke something on our end.

I think a failure could also be an out-of-band change to the machine's persistent state like OS upgrades or any other activities a human might do, or... following the theory above, seeding the cache with incorrect settings.

huonw commented 3 months ago

A possibility to validate this would be

I've done this.

1. cleared cache at 2024-04-10T00:42Z (https://github.com/pantsbuild/pants/actions/runs/8624215879/job/23638827121) 2. retriggered the ARM64 macOS "build wheels" job for 2.21.0.dev5 to republish the PEX for that release (https://github.com/pantsbuild/pants/actions/runs/8582404805/job/23638846957) 1. that downloaded the sdist and built a wheel for `psutil` at 2024-04-10T00:42:32Z, within the job with `ARCHFLAGS: -arch arm64` set: https://github.com/pantsbuild/pants/actions/runs/8582404805/job/23638846957#step:9:216 ✅ 1. that job failed because the release wheel and PEX already exist for this platform 3. delete the macOS arm64 wheel and PEX from https://github.com/pantsbuild/pants/releases/tag/release_2.21.0.dev5 4. retry the job: https://github.com/pantsbuild/pants/actions/runs/8582404805/job/23639138015 1. that used the `psutil` wheel from cache: https://github.com/pantsbuild/pants/actions/runs/8582404805/job/23639138015#step:9:378 5. that succeeded and uploaded new artifacts

The artifacts are still broken ❌

I theorise that the ARCHFLAGS: -arch arm64 settings in this repo aren't doing what we expect / what they used to.

I'll forge ahead with validating whether upgrading to a psutil version with wheels unblocks releases (#20775), and apply that to the other release branches too (#20773, #20774).


Taking a different tack, I configured my laptop (M1, running macOS 14.4) as a self-hosted runner in a private repo just now, and ran the workflow from https://github.com/pantsbuild/pants/pull/20766 on it, which has various tests like pip install psutil and arch and clang invocations. My laptop gives me behaviour closer to what I'd hope, running ARM64 by default:

command behaviour on laptop behaviour on pants CI
arch arm64 x86_64
pip install psutil==5.9.0 --no-cache .so files are arm64 x86_64
pip install ... with ARCHFLAGS=-arch arm64 .so files are arm64 arm64
clang ... arm64 output ✅ x86_64 output ❌
actions runner version 2.315.0 ✅ 2.315.0 ✅

So, maybe we could resolve this issue (and the need to use arch or ARCHFLAGS everywhere in a reliable way) by: working out how to get the self-hosted runner to be running jobs as arm64 by default.

huonw commented 3 months ago

Ah, further note, the .deps/pantsbuild.pants-...-cp39-cp39-macosx_11_0_arm64.whl/native_engine.cpython-39-darwin.so file (compiled from dummy.c, and, importantly, not cached in the same way as the psutil wheel) is currently x86_64, but used to be arm64, and it changed between these two releases:

https://github.com/pantsbuild/pants/compare/release_2.17.0.dev5...release_2.17.0a0

These releases involved significant changes to the infrastructure/release process, so it seems plausible that they may've regressed the ARCHFLAGS handling.

```shell #!/usr/bin/env bash function check() { version="$1" echo "checking version $version" download=$(mktemp -d) curl -L https://github.com/pantsbuild/pants/releases/download/release_$version/pants.$version-cp39-darwin_arm64.pex -o $download/pex path=".deps/pantsbuild.pants-$version-cp39-cp39-macosx_11_0_arm64.whl/native_engine.cpython-39-darwin.so" unzip $download/pex $path -d $download file $download/$path } check 2.17.0.dev5 check 2.17.0a0 ``` ``` ./check-pants-arch.sh checking version 2.17.0.dev5 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 18.1M 100 18.1M 0 0 6733k 0 0:00:02 0:00:02 --:--:-- 7954k Archive: /var/folders/sv/vd266m4d4lvctgs2wpnhjs9w0000gn/T/tmp.fvsML1lLG4/pex inflating: /var/folders/sv/vd266m4d4lvctgs2wpnhjs9w0000gn/T/tmp.fvsML1lLG4/.deps/pantsbuild.pants-2.17.0.dev5-cp39-cp39-macosx_11_0_arm64.whl/native_engine.cpython-39-darwin.so /var/folders/sv/vd266m4d4lvctgs2wpnhjs9w0000gn/T/tmp.fvsML1lLG4/.deps/pantsbuild.pants-2.17.0.dev5-cp39-cp39-macosx_11_0_arm64.whl/native_engine.cpython-39-darwin.so: Mach-O 64-bit bundle arm64 checking version 2.17.0a0 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 18.9M 100 18.9M 0 0 7086k 0 0:00:02 0:00:02 --:--:-- 12.4M Archive: /var/folders/sv/vd266m4d4lvctgs2wpnhjs9w0000gn/T/tmp.7NOMfJOBoH/pex inflating: /var/folders/sv/vd266m4d4lvctgs2wpnhjs9w0000gn/T/tmp.7NOMfJOBoH/.deps/pantsbuild.pants-2.17.0a0-cp39-cp39-macosx_11_0_arm64.whl/native_engine.cpython-39-darwin.so /var/folders/sv/vd266m4d4lvctgs2wpnhjs9w0000gn/T/tmp.7NOMfJOBoH/.deps/pantsbuild.pants-2.17.0a0-cp39-cp39-macosx_11_0_arm64.whl/native_engine.cpython-39-darwin.so: Mach-O 64-bit bundle x86_64 ```
huonw commented 3 months ago

I've put out 2.19.3rc0, 2.20.0rc4 and 2.21.0.dev6 with the psutil wheel workaround, that all seem to work on my arm64 macOS machine.

huonw commented 3 months ago

I think we've resolved the acute issue caused by the specific dependency with native code but no wheel, and the runner settings mean that was being built incorrectly. The resolution: update to psutil==5.9.8 which does have a wheel for arm64 macOS & cherry-pick that to all live branches.

However, the underlying bug is still here. I've filed https://github.com/pantsbuild/pants/issues/20790 for that.