ocaml / dune

A composable build system for OCaml.
https://dune.build/
MIT License
1.63k stars 409 forks source link

Cache doesn't work on `opam install` #6162

Open Leonidas-from-XIV opened 2 years ago

Leonidas-from-XIV commented 2 years ago

Expected Behavior

Enabling the dune cache would allow speedup for (re)building packages when (re)installing them via OPAM.

Actual Behavior

The cache folders are created, but no files are getting cached.

Reproduction

  1. export DUNE_CACHE=enabled
  2. export XDG_CACHE_HOME=$(pwd)/dune-cache
  3. opam install some_packages that_use_dune
  4. tree $XDG_CACHE_HOME
  5. Note that the folders were created but no files

As we debugged with @emillon this is due to the OPAM sandbox bind-mounting the cache folder inside the sandbox, so even on the same file system the cache files can't be hardlinked.

A possible workaround is export DUNE_CACHE_STORAGE_MODE=copy and it does populate the cache with files, but it could be useful to detect this case and find a solution for it.

Specifications

emillon commented 2 years ago

cc @dra27 (the TL;DR version of this issue is that the opam sandbox prevents hardlinking the build to the dune cache)

rgrinberg commented 2 years ago

What's the error that you got? It would be nice to improve it to make the user aware of this possibility.

Leonidas-from-XIV commented 2 years ago

There is no error. It just creates the directories in the cache folder but doesn't populate the cache (thus opam install'ing again won't be able to use the cache either, since it never gets filled).

rgrinberg commented 2 years ago

Okay, but then dune silently fails to populate the cache? That seems like a serious issue.

Leonidas-from-XIV commented 2 years ago

I assume the way the cache is implemented it defaults to hardlinking and if that fails it will not fall back to copying, instead it will do nothing.

Which can make sense sometimes, since just falling back to copying potentially creates lots of I/O, but OTOH hardlinking is somewhat of an optimization so if it doesn't work I would expect it to fall back to copying which works fine in the OPAM sandbox.

Blaisorblade commented 2 years ago

@Leonidas-from-XIV Is export DUNE_CACHE_STORAGE_MODE=copy an effective workaround? That can be set in the dune config, but the environment variable has the advantage that it could be enabled from the sandboxing scripts! (EDIT: "workaround" = the lack of fallback is still an issue).

Leonidas-from-XIV commented 2 years ago

Yes it is a workaround, I mention this in the original message.

The scripts could set that environment variable, but the issue is that the scripts are AFAIK part of the users config so people would need to update to a newer OPAM which ships the scripts and also they would need to upgrade their sandboxing scripts with the newer ones.

rgrinberg commented 2 years ago

but the issue is that the scripts are AFAIK part of the users config so people would need to update to a newer OPAM which ships the scripts and also they would need to upgrade their sandboxing scripts with the newer ones.

Isn't it possible to set the environment variable from opam somehow? I.e. have the dune package export in some way.

rgrinberg commented 2 years ago

I assume the way the cache is implemented it defaults to hardlinking and if that fails it will not fall back to copying, instead it will do nothing.

IMO that's pretty bad behavior. If the user asked for the cache to be enabled, we should complain loudly that the cache isn't functioning. @snowleopard what do you think?

emillon commented 2 years ago

Isn't it possible to set the environment variable from opam somehow? I.e. have the dune package export in some way.

it's possible using setenv: but that's frowned upon. If we go that route I think that the preferred way is to update the sandbox scripts.

And I agree too that failing silently is not a good failure mode, some alternatives are:

Leonidas-from-XIV commented 2 years ago

Degrading sounds good but I doubt we can do that loudly, at least not in the context of opam install since opam eats all the stdout/stderr and logs it away and only displays it on build failure. It might still be worthwhile to show an error instead of silently doing something else than the user wanted.

As a slightly different proposal, maybe DUNE_CACHE_STORAGE_MODE could be a priority list of strategies to be used? DUNE_CACHE_STORAGE_MODE=hardlink,copy would silently fall back to copy (and be the default to not break existing users expectations), whereas DUNE_CACHE_STORAGE_MODE=hardlink would fail with a suitable error message (that then would also be displayed by opam).

snowleopard commented 2 years ago

IMO that's pretty bad behavior. If the user asked for the cache to be enabled, we should complain loudly that the cache isn't functioning. @snowleopard what do you think?

@rgrinberg I tend to view the cache as "just" an optimisation, and it's not obvious that failing hard is the right thing to do if this optimisation is not effective for some reason. Would a user prefer a slower but successful build or a hard failure that needs to be understood, investigated and fixed before getting the desired build results? I'm guessing that a large number of users would prefer a slower build but it's hard to be certain without conducting a survey.

Furthermore, I think users who care most about such optimisations are likely to be power users responsible for building large code bases. We might expect them to be able to debug issues around caches and other optimisations. Of course, it would be nice to give them better tools for that. Perhaps, failing hard on cache issues could be a configuration setting.

degrade (loudly?) to a more portable but less efficient technique like copying

@emillon If we agree that we think about the cache as an optimiation, then switching to a degraded mode sounds fine to me. Informing the user about that sounds good too.

Blaisorblade commented 2 years ago

If degrading solves all issues, that's great — the waste of disk space seems a much smaller concern. But just in case it doesn't:

Furthermore, I think users who care most about such optimisations are likely to be power users responsible for building large code bases ... Would a user prefer a slower but successful build or a hard failure that needs to be understood, investigated and fixed before getting the desired build results? I'm guessing that a large number of users would prefer a slower build but it's hard to be certain without conducting a survey.

Perhaps, failing hard on cache issues could be a configuration setting.

Yes please — and I'd default it to on. The failure message should just suggest to turn it off if desired. The reason is that at least Coq beginners are affected, and the failure is silent enough that power users don't even know upgrading dune is supposed to be fast, instead of taking hours (https://github.com/coq/coq/issues/16325 counts, IMHO — cc @RalfJung).

Generally, you don't need to be an expert to opam install a huge Coq library, especially not a CLI expert. In practice, is this exclusive to Coq users?