install commands run in parallel in multiple switches produces a weird failure

chetmurthy commented 1 year ago

opam version: 2.1.2 Linux: Ubuntu 22.04.1 LTS

I'm trying to perform opam install commands in multiple switches in parallel, and got this rather cryptic error, which I don't know how to interpret. Here's what I'm doing:

create a number of empty switches, using opam switch create -y --empty $v (where $v is "4.02.0" or "4.14.0", for example). I'm installing 27 switches in total (between 4.02.0 and 4.13.1).
then for each such value of $v, run
```
OPAMYES=true ./inopam $v opam install --yes ocaml-base-compiler.$v &
```
where inopam is described here: https://discuss.ocaml.org/t/newbie-tip-using-multiple-projects-switches-from-command-line-and-emacs/8303?u=chet_murthy It sets the relevant environment variables for a switch, and then runs a command.

inopam is just a little wrapper around opam exec that does --set-root and --set-switch.

So when I run this "install command" in each switch, it seems like they should all be completely independent of each other, and since the switches have all been created, I don't understand why there should be a global config.lock. My intent is of course to move on to installing massive numbers of packages in all these switches, in order to do CI testing across all of them.

What is even weirder, if I look at that switch (4.13.1), it turns out that the package that is supposed to have failed, installed just fine.

So it seems like there's some error going on, but I don't know what it is.

[ERROR] Another process has locked /home/chet/Hack/Opam-2.1.2/GENERIC/config.lock and non blocking mode enabled

#=== ERROR while installing ocaml-base-compiler.4.13.1 ========================#
OpamStd.OpamSys.Exit(15)

<><> Error report <><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
┌─ The following actions failed
│ ∗ install ocaml-base-compiler 4.13.1
└─ 
┌─ The following changes have been performed (the rest was aborted)
│ ∗ install base-bigarray         base
│ ∗ install base-threads          base
│ ∗ install base-unix             base
│ ∗ install ocaml-options-vanilla 1
└─ 

<><> ocaml-base-compiler.4.13.1 troubleshooting <><><><><><><><><><><><><><><><>
=> A failure in the middle of the build may be caused by build parallelism
      (enabled by default).
      Please file a bug report at https://github.com/ocaml/opam-repository/issues
=> You can try installing again including --jobs=1
      to force a sequential build instead.

The former state can be restored with:
    /home/chet/Hack/Opam-2.1.2/bin/opam switch import "/home/chet/Hack/Opam-2.1.2/GENERIC/4.13.1/.opam-switch/backup/state-20221026055847.export"

kit-ty-kate commented 1 year ago

Note for ourselves:

During installation there shouldn't be any global lock in theory
In scripts opam enables non-blocking mode (do we really need this mode?)

rjbou commented 1 year ago

I wasn't able to reproduce. How did you parallelise you switches populating?

chetmurthy commented 1 year ago

I used a shellscript with a for-do-done loop in which that line above

OPAMYES=true ./inopam $v opam install --yes ocaml-base-compiler.$v &

was executed. I suspect that there's only a small window in which this lock is held. Also, I did 27 switches in parallel on a 16-thread AMD Ryzen machine -- so somewhat fast -- and that probably increased the chance of hitting the problem.

chetmurthy commented 1 year ago

Hello, I found another instance of the same problem. I've been using this "inopam" command to run package-installs on a large number of switches in parallel (to test out package changes before submitting to opam CI). The command I've ben using is

opam install --verbose --working-dir --with-test .

and when I ran this with "inopam", I got the following output:

Another process has locked /home/chet/Hack/Opam-2.1.2/GENERIC/config.lock, waiting (C-c to abort)... lock acquired.
Cleaning up switch 4.07.0

for one switch, out of the 21 switches I was running in parallel.

Since this is a mere package-install, it would seem there is definitely a global lock in opam that is used during mere package installation. If there were a way to cause opam commands to wait until the lock were free (instead of just failing), that would be useful.

rjbou commented 1 year ago

Can you share outputs (err_* files) of this command parrallelised?

./inopam $switch opam <your command> --yes --debug-level 2 2>&1 | grep LOCK > err_$switch

chetmurthy commented 1 year ago

If it happens again (pretty likely, since it's happened every time I've rebuilt those switches) I will definitely capture all the data. Can you be precise about what you'd like me to capture? The more precise you can be, the more data I'll capture.

chetmurthy commented 1 year ago

OK, here's another bit of what might be the same problem. I ran (using "inopam", on 21 switches in parallel) the command

opam clean -y -s

and got the output

Another process has locked /home/chet/Hack/Opam-2.1.2/GENERIC/config.lock, waiting (C-c to abort)... lock acquired.

more than a few times. So it would seem that opam clean -s also has a global lock in it (which it ought not to have, right? B/c switch-specific cleanup shouldn't incur a global lock, right?)

chetmurthy commented 1 year ago

Also: I'm re-running those commands right now, and will do so a bunch of times, in hopes of triggering the bug. It's .... not very frequent, it would seem.

rjbou commented 1 year ago

I'm trying to figure out what locks the state as it shouldn't. In fact, you can even remove the grep LOCK and just keep the debug-level 2. I'm wondering if it's not an underneath opam call done in package build/install that is blocking, but it shouldn't...

chetmurthy commented 1 year ago

If you could send me a patch that logs a stack-trace every time a lock is taken, I could send you back the output of running opam with that patch applied.

Oops, I mean, every time it tries to take a lock, and somebody else already has it.

rjbou commented 1 year ago

The idea of generating the err_ files is to try to see which / if an opam command try to get a global lock (you can take a look at opam install xxx --debug-level 2 stderr output). It's not the blocked process that need to be looked at, but the random other one that takes a global lock in the between. In fact, you can add this in your inopam script to automatise it:

if [ "$1" = "opam" ] ; then
  cmd="$@ --debug-level 2 2> /tmp/debugopam/err_$switch"
else
  cmd="$@"
fi
OPAMDEBUGSECTIONS="GSTATE SYSTEM" # to retrieve only a subset of debug infos
exec opam exec --switch "$switch" --set-switch --set-root -- "$cmd"
# clean debug files as everything succeed ; or remove them by hand to not automatise an rm
if [ $? -eq 0 ]; then
  rm /tmp/debugopam/err_*
fi

But if you prefer a patch to opam (less disruption), it is also possible.

For opam clean -y -s it's normal that it results on a lock as opam take a write lock on global state. We could take the lock more precisely according the arguments (no global state lock if it is only switch cleaning eg).

chetmurthy commented 1 year ago

your script is fine -- I'll use that from now on.
sure I can understand that when opam clean cleans up global state, it should take a global lock; but -s means it's cleaning up switch-specific state, right? So it should not take a global lock?
The usage that drove creating this issue was installing packages in a switch, so that was definitely not modifying global state. So in that case, it would have sufficed to have a stack trace of the process that was attempting to take the lock (and finding it already held), right?

chetmurthy commented 1 year ago

continuing with #3: So this would mean that if we instrumented all locations where a global lock is taken, then we could figure out afterwards which such locations were legitimate. E.g., we could print out the stack-trace, and the Sys.argv, and use that to decide whether a global lock should be taken for the command in question. I mean, as a database guy, I get that we shouldn't take locks at fine-granularity: if a lock is needed, it should be taken around the largest scope possible, b/c that makes debugging simple.

So it seems like having a patch to print out stack-trace (and Sys.argv) still might be useful.

rjbou commented 1 year ago

In the idea I agree, but in the code it is always a write global lock that is taken for opam clean. For that i'll open a new issue to refine lock taking for opam clean
a. I confirm that for install a global lock should not be taken (unless there is a hard root upgrade, but it is not your case ; or it will fail in all cases).

3.. b. > So in that case, it would have sufficed to have a stack trace of the process that was attempting to take the lock (and finding it already held), right?

No because we need to know at first who wrongly took the write global lock. Opam install that takes a read|none global lock is normal, it is blocked only because another opam install process shouldn't took the write lock.

So this would mean that if we instrumented all locations where a global lock is taken, then we could figure out afterwards which such locations were legitimate. E.g., we could print out the stack-trace, and the Sys.argv, and use that to decide whether a global lock should be taken for the command in question.

Yes, if we do it for all processes, we can figure out which one takes a write lock while it shouldn't. It's quite that which is done with the script, it is just getting the information from outside (with extra arguments and debugging info) instead internally by patching. But yes, patching will give more fine-grained information.

ocaml / opam

install commands run in parallel in multiple switches produces a weird failure #5328