ocaml / dune

A composable build system for OCaml.
https://dune.build/
MIT License
1.64k stars 409 forks source link

Error "rmdir(/home/.../.cache/dune/db/temp/dune_..._artifacts): Directory not empty" on an NFS setup #9071

Open rlepigre opened 1 year ago

rlepigre commented 1 year ago

We are getting tons of errors of the form

Error:
rmdir(/home/.../.cache/dune/db/temp/dune_..._artifacts): Directory not empty
-> required by _build/default/...
-> required by alias ...
-> required by alias ...

in our CI, which stores the dune cache on an NFS partition shared among several servers.

This setup used to work fine with dune 3.7.0, but now fails on version 3.11.1.

We managed to unblock ourselves by applying a workaround given in https://github.com/ocaml/dune/issues/8228, which consists in defining:

DUNE_CONFIG__BACKGROUND_DIGESTS=disabled

The problem seems to be the same, but not on Windows.

This could also potentially be related to https://github.com/ocaml/dune/issues/7917.

Blaisorblade commented 1 year ago

FWIW, dune docs don't promise support for NFS, but Emilio and @rgrinberg wrote it should work, and this has been stable for months.

rgrinberg commented 1 year ago

cc @emillon

emillon commented 1 year ago

Based on:

rmdir(/home/.../.cache/dune/db/temp/dune_..._artifacts): Directory not empty

I'd say that this comes from here:

https://github.com/ocaml/dune/blob/3.11.1/src/dune_cache/local.ml#L196

and that removing the temporary directory fails when exiting this block, because a file couldn't be removed. I'll check what we can do about that.

emillon commented 1 year ago

Not the prettiest of tests but I could reproduce by making an unlink call fail:

  $ export DUNE_CACHE_ROOT=`pwd`/cache
  $ export DUNE_CACHE=enabled

  $ cat > dune-project << EOF
  > (lang dune 1.0)
  > EOF

  $ cat > dune << EOF
  > (executable
  >  (name e))
  > EOF

  $ touch e.ml

  $ strace -e unlink -o x.strace dune build
  $ unlink_id=$(<x.strace head -n -1| tail -n +2|cat -n|grep _artifacts|head -n1|cut -f1)
  $ dune clean; rm -rf $DUNE_CACHE_ROOT
  $ strace -e fault=unlink:when="$unlink_id" -o x.strace dune build 2>&1| sed -e 's/dune_.*_artifacts/dune_*_artifacts/'
  Error:
  rmdir($TESTCASE_ROOT/cache/temp/dune_*_artifacts): Directory not empty
  -> required by _build/default/.dune/configurator

I think that we should emit a warning in these cases instead of failing:

rlepigre commented 1 year ago

Cool, thanks for looking into this!

I think that we should emit a warning in these cases instead of failing:

  • if no unlink failed, expect that rmdir should not fail
  • if an unlink failed, turn rmdir failure into a warning

Sounds like this would fix our problem, assuming there is a way to silent such warning.

Would it make any sense to retry the failing unlink? Or perhaps try to collect some information as to why the unlink failed, and add it to the warning?

emillon commented 12 months ago

I see that as a first step. Crashing is definitely a problem. Emitting warnings and leaving files behind is not ideal, but it's better, and it will give us a sense of how often unlinking fails, as well as what kind of files. This will be useful if we decide to make NFS a completely supported target by adding a retry mechanism or something like that.