Open mtoohey31 opened 2 months ago
I think this should be considered a Lake issue, it applies to any other build step output just as well.
I think the particular error I'm running into is because of the use of the tempfile and the rename
operation. The .olean
files appear to be the only case where this specific approach is used, but I agree that we should probably look at how other outputs are saved too, cause it doesn't seem like there's anything enforcing exclusive access with those either.
In terms of considering this a Lake issue, are you thinking we should solve this via locking or something in Lake, instead of by modifying how Lean saves stuff?
Yes, this needs locking in Lake. However, in the interest of defense in depth, we may want to upgrade to #5125 here. Please consider opening a new issue for general Lake output race conditions.
Prerequisites
Please put an X between the brackets as you perform the following steps:
Description
When there are multiple concurrent
lean
processes and both are runninglean_save_module_data
for the same file simultaneously, because the name of the tempfile is always the same, the rename operation here can fail if another process has already renamed the file, such that there is no longer a file atolean_tmp_fn
.Context
I initially ran into this issue when using the language server. I'm using it via Helix so when I modify a dependency of one of the open files, I restart the language server using the
:lsp-restart
command. This seems to lead to a situation where multiplelean
processes are being invoked for the same file simultaneously (I don't know the details of why this happens), which often causes the error above to occur.Steps to Reproduce
https://github.com/mtoohey31/lean-race-condition-repro
lake build Repro
to ensure it builds successfully with just a single instance.lake clean
.lake build Repro & lake build Repro
(the single&
is intentional cause we need at least two concurrent instances to cause the issue). This may be somewhat scheduler/platform dependent since it's a concurrency issue (though it occurs quite consistently for me). If it doesn't reproduce on the first try on your machine, repeat step 3 before retrying 4 to ensure the files actually get rebuilt. (For some reason the issue doesn't reproduce if I fully delete.lake
before running this step, so I'd recommend sticking tolake clean
.)Expected behavior: Both lake processes should succeed without error.
Actual behavior: One process succeeds, but the other fails with the following output:
Versions
"4.12.0-nightly-2024-08-17"
Additional Information
I think in order to fix this (and avoid other related issues) we should make the following two changes to how the tempfile is chosen/opened:
O_EXCL
flag to ensure no other process uses the same file. We can include retry logic to decrease failures in the unlikely case of the random components matching.I'd be happy to implement this fix if maintainers agree that this is the right solution.