Open rlepigre opened 1 year ago
FWIW, dune docs don't promise support for NFS, but Emilio and @rgrinberg wrote it should work, and this has been stable for months.
cc @emillon
Based on:
rmdir(/home/.../.cache/dune/db/temp/dune_..._artifacts): Directory not empty
I'd say that this comes from here:
https://github.com/ocaml/dune/blob/3.11.1/src/dune_cache/local.ml#L196
and that removing the temporary directory fails when exiting this block, because a file couldn't be removed. I'll check what we can do about that.
Not the prettiest of tests but I could reproduce by making an unlink
call fail:
$ export DUNE_CACHE_ROOT=`pwd`/cache
$ export DUNE_CACHE=enabled
$ cat > dune-project << EOF
> (lang dune 1.0)
> EOF
$ cat > dune << EOF
> (executable
> (name e))
> EOF
$ touch e.ml
$ strace -e unlink -o x.strace dune build
$ unlink_id=$(<x.strace head -n -1| tail -n +2|cat -n|grep _artifacts|head -n1|cut -f1)
$ dune clean; rm -rf $DUNE_CACHE_ROOT
$ strace -e fault=unlink:when="$unlink_id" -o x.strace dune build 2>&1| sed -e 's/dune_.*_artifacts/dune_*_artifacts/'
Error:
rmdir($TESTCASE_ROOT/cache/temp/dune_*_artifacts): Directory not empty
-> required by _build/default/.dune/configurator
I think that we should emit a warning in these cases instead of failing:
unlink
failed, expect that rmdir
should not failunlink
failed, turn rmdir
failure into a warningCool, thanks for looking into this!
I think that we should emit a warning in these cases instead of failing:
- if no
unlink
failed, expect thatrmdir
should not fail- if an
unlink
failed, turnrmdir
failure into a warning
Sounds like this would fix our problem, assuming there is a way to silent such warning.
Would it make any sense to retry the failing unlink
? Or perhaps try to collect some information as to why the unlink
failed, and add it to the warning?
I see that as a first step. Crashing is definitely a problem. Emitting warnings and leaving files behind is not ideal, but it's better, and it will give us a sense of how often unlinking fails, as well as what kind of files. This will be useful if we decide to make NFS a completely supported target by adding a retry mechanism or something like that.
We are getting tons of errors of the form
in our CI, which stores the dune cache on an NFS partition shared among several servers.
This setup used to work fine with dune 3.7.0, but now fails on version 3.11.1.
We managed to unblock ourselves by applying a workaround given in https://github.com/ocaml/dune/issues/8228, which consists in defining:
The problem seems to be the same, but not on Windows.
This could also potentially be related to https://github.com/ocaml/dune/issues/7917.