pytorch / executorch

On-device AI across mobile, embedded and edge for PyTorch
https://pytorch.org/executorch/
Other
2.2k stars 368 forks source link

macOS nightly wheel builds failing since 2024-11-19 #7019

Open swolchok opened 11 hours ago

swolchok commented 11 hours ago

πŸ› Describe the bug

Status page: https://github.com/pytorch/executorch/actions/workflows/build-wheels-m1.yml Note that the Python 3.9 build always fails, so even though the runs are red, they were successful through 2024-11-18.

Linking is failing with ld: invalid use of ADRP in '_init_f32_vcopysign_config' to '_xnn_f32_vcopysign_ukernel__neon_u8’.

Versions

N/A

swolchok commented 11 hours ago

Inspection of PRs landed between the last good build and first bad build suggested the following:

Trial revert of #6837 in #7013 still failed the job; trialing revert of the other two PRs together

swolchok commented 11 hours ago

trial revert of #6522 in https://github.com/pytorch/executorch/pull/7020 did not fix the job

swolchok commented 10 hours ago

trial revert of #6892 in #7021 did not fix the job.

I am also not able to repro this locally, and I've inspected git diff 8526d0a2d798658b6a6e3a42ec935b8093f355ef..04f6fcd4b3920eaf1be9905d12b449f301f89ca7 without finding anything else suspicious, so I wonder if the runners broke somehow

swolchok commented 10 hours ago

I wonder if the runners broke somehow

I reran the last good workflow run; builds succeeded (there were some failures due to an unrelated issue).

larryliu0820 commented 9 hours ago

Found a failure with the same error message in a different job (test-llama-runner-mac): https://github.com/pytorch/executorch/actions/runs/11959891658/job/33342737621?pr=7010

swolchok commented 9 hours ago

Found a failure with the same error message in a different job (test-llama-runner-mac): https://github.com/pytorch/executorch/actions/runs/11959891658/job/33342737621?pr=7010

that job is green on trunk runs though! https://hud.pytorch.org/hud/pytorch/executorch/main/1?per_page=50&name_filter=llama-runner-mac%20(fp32%2C%20mps

kimishpatel commented 9 hours ago

am late to this so not sure my comments will help, but any change related to xnnpack upgrade? since the job fails related xnnpack

swolchok commented 9 hours ago

@larryliu0820 suggested maybe the runner toolchain changed.

It looks like we're using macos-m1-stable runners for test-llama-runner-mac: https://github.com/pytorch/executorch/blob/main/.github/workflows/trunk.yml#L236 not sure what runner the wheel build uses

I don't know a whole lot about this runner type, but I see that 1) it seems to be in-house: https://github.com/pytorch/pytorch/issues/127490 2) I don't see recent activity in https://github.com/pytorch-labs/pytorch-gha-infra/ suggesting that there was a recent update

swolchok commented 9 hours ago

any change related to xnnpack upgrade

as I mentioned above, I inspected all the commits (there aren't many) in the range of commit hashes flagged in the nightly builds.

larryliu0820 commented 9 hours ago

An example of trunk job passing:

https://github.com/pytorch/executorch/actions/runs/11962683652/job/33351640398

An example of PR job failing:

https://github.com/pytorch/executorch/actions/runs/11959891658/job/33342745520?pr=7010

I don't see obvious difference between these 2, regarding environment setup.

@huydhn anything obvious to you?