Open shonfeder opened 1 month ago
is it reproducible or does it only happen from time to time?
It's reproducible. E.g., every Jane Street package looks to be suffering the same fate currently: https://opam.ci.ocaml.org/github/ocaml/opam-repository/commit/b0fb4f8c144e4e78cd6de1972fc3453a2024d8a8
It seems to happen only on arm32 & freebsd images.
If it is at repository reloading stage, it shouldn't go through that code as in the image it is defined as a directory (file:///home/opam/opam-repository).
Is it possible to extract a backtrace and some logs (-vv | --debug)?
I'll see about getting this reproducing net week. I also realized I didn't take into account the container caching when I claimed it is reproducible, and all of the CI jobs I've looked at so far are pulling that step from the cache.
Trying to debug this without access to those machine has so far not produced any results. I've opened https://github.com/ocaml/opam/pull/5975 to at least show a more decent error message, which would help debug this further. My instinct tells me it is due to a file that is somehow removed on those arm machines but i'm still baffled as to why only arm (arm32 and arm64) machines are affected.
The failure came from the fact that the image got broken somewhere and the $HOME
directory was no longer readable, writeable or owned by the proper user.
The error message should be fixed though. I'm planning to open a more lightweight version of https://github.com/ocaml/opam/pull/5975 very soon to catch that sooner and display a better error message. I've removed this issue from the 2.2 board as it is no longer urgent.
First noticed (afaik) at https://github.com/ocaml/opam-repository/pull/25905#issuecomment-2119010020
The error we're seeing in CI is
which can be seen in, e.g., this CI log
The failing assertion is at
https://github.com/ocaml/opam/blob/391333d35bcdc8b55df709b876b8bafcf75f3452/src/repository/opamDownload.ml#L140