leanprover / lean4

Lean 4 programming language and theorem prover
https://lean-lang.org
Apache License 2.0
4.66k stars 418 forks source link

Race condition in `lean_save_module_data` with concurrent `lean` processes #5084

Open mtoohey31 opened 2 months ago

mtoohey31 commented 2 months ago

Prerequisites

Please put an X between the brackets as you perform the following steps:

Description

When there are multiple concurrent lean processes and both are running lean_save_module_data for the same file simultaneously, because the name of the tempfile is always the same, the rename operation here can fail if another process has already renamed the file, such that there is no longer a file at olean_tmp_fn.

Context

I initially ran into this issue when using the language server. I'm using it via Helix so when I modify a dependency of one of the open files, I restart the language server using the :lsp-restart command. This seems to lead to a situation where multiple lean processes are being invoked for the same file simultaneously (I don't know the details of why this happens), which often causes the error above to occur.

Steps to Reproduce

  1. Clone https://github.com/mtoohey31/lean-race-condition-repro
  2. Run lake build Repro to ensure it builds successfully with just a single instance.
  3. Run lake clean.
  4. Run lake build Repro & lake build Repro (the single & is intentional cause we need at least two concurrent instances to cause the issue). This may be somewhat scheduler/platform dependent since it's a concurrency issue (though it occurs quite consistently for me). If it doesn't reproduce on the first try on your machine, repeat step 3 before retrying 4 to ensure the files actually get rebuilt. (For some reason the issue doesn't reproduce if I fully delete .lake before running this step, so I'd recommend sticking to lake clean.)

Expected behavior: Both lake processes should succeed without error.

Actual behavior: One process succeeds, but the other fails with the following output:

✖ [2/5] Building Repro.World
trace: .> LEAN_PATH=././.lake/build/lib LD_LIBRARY_PATH= .../.elan/toolchains/nightly/bin/lean ././././Repro/World.lean -R ./././. -o ././.lake/build/lib/Repro/World.olean -i ././.lake/build/lib/Repro/World.ilean -c ././.lake/build/ir/Repro/World.c --json
info: stderr:
failed to write '././.lake/build/lib/Repro/World.olean': 2 No such file or directory
error: Lean exited with code 1
Some required builds logged failures:
- Repro.World
error: build failed

Versions

"4.12.0-nightly-2024-08-17"

$ cat /etc/lsb-release
DISTRIB_CODENAME=vicuna
DISTRIB_DESCRIPTION="NixOS 24.11 (Vicuna)"
DISTRIB_ID=nixos
DISTRIB_RELEASE="24.11"
LSB_VERSION="24.11 (Vicuna)"
$ uname -r
6.6.44

Additional Information

I think in order to fix this (and avoid other related issues) we should make the following two changes to how the tempfile is chosen/opened:

  1. We should add a random component to the filename.
  2. We should open it with some method that sets the O_EXCL flag to ensure no other process uses the same file. We can include retry logic to decrease failures in the unlikely case of the random components matching.

I'd be happy to implement this fix if maintainers agree that this is the right solution.

Kha commented 2 months ago

I think this should be considered a Lake issue, it applies to any other build step output just as well.

mtoohey31 commented 2 months ago

I think the particular error I'm running into is because of the use of the tempfile and the rename operation. The .olean files appear to be the only case where this specific approach is used, but I agree that we should probably look at how other outputs are saved too, cause it doesn't seem like there's anything enforcing exclusive access with those either.

In terms of considering this a Lake issue, are you thinking we should solve this via locking or something in Lake, instead of by modifying how Lean saves stuff?

Kha commented 2 months ago

Yes, this needs locking in Lake. However, in the interest of defense in depth, we may want to upgrade to #5125 here. Please consider opening a new issue for general Lake output race conditions.