ocaml / opam

opam is a source-based package manager. It supports multiple simultaneous compiler installations, flexible package constraints, and a Git-friendly development workflow.
https://opam.ocaml.org
Other
1.21k stars 348 forks source link

`OpamDownload` assertion failure is causing opam-repo-ci builds to fail on arm32-ocaml-4.14 #5971

Open shonfeder opened 1 month ago

shonfeder commented 1 month ago

First noticed (afaik) at https://github.com/ocaml/opam-repository/pull/25905#issuecomment-2119010020

The error we're seeing in CI is

/home/opam: (run (network host)
                 (shell "opam init --reinit --config .opamrc-sandbox -ni"))
Fatal error:
File "src/repository/opamDownload.ml", line 140, characters 2-8: Assertion failed
"/usr/bin/linux32" "/bin/sh" "-c" "opam init --reinit --config .opamrc-sandbox -ni" failed with exit status 99

which can be seen in, e.g., this CI log

The failing assertion is at

https://github.com/ocaml/opam/blob/391333d35bcdc8b55df709b876b8bafcf75f3452/src/repository/opamDownload.ml#L140

kit-ty-kate commented 1 month ago

is it reproducible or does it only happen from time to time?

dbuenzli commented 1 month ago

FWIW it also happened on the cmdliner release here.

shonfeder commented 1 month ago

It's reproducible. E.g., every Jane Street package looks to be suffering the same fate currently: https://opam.ci.ocaml.org/github/ocaml/opam-repository/commit/b0fb4f8c144e4e78cd6de1972fc3453a2024d8a8

rjbou commented 1 month ago

It seems to happen only on arm32 & freebsd images. If it is at repository reloading stage, it shouldn't go through that code as in the image it is defined as a directory (file:///home/opam/opam-repository). Is it possible to extract a backtrace and some logs (-vv | --debug)?

shonfeder commented 1 month ago

I'll see about getting this reproducing net week. I also realized I didn't take into account the container caching when I claimed it is reproducible, and all of the CI jobs I've looked at so far are pulling that step from the cache.

kit-ty-kate commented 1 month ago

Trying to debug this without access to those machine has so far not produced any results. I've opened https://github.com/ocaml/opam/pull/5975 to at least show a more decent error message, which would help debug this further. My instinct tells me it is due to a file that is somehow removed on those arm machines but i'm still baffled as to why only arm (arm32 and arm64) machines are affected.

kit-ty-kate commented 1 month ago

The failure came from the fact that the image got broken somewhere and the $HOME directory was no longer readable, writeable or owned by the proper user.

The error message should be fixed though. I'm planning to open a more lightweight version of https://github.com/ocaml/opam/pull/5975 very soon to catch that sooner and display a better error message. I've removed this issue from the 2.2 board as it is no longer urgent.