ipetkov / crane

A Nix library for building cargo projects. Never build twice thanks to incremental artifact caching.
https://crane.dev
MIT License
947 stars 92 forks source link

symlinking of crates causes issues with macos sandbox #482

Open j-baker opened 11 months ago

j-baker commented 11 months ago

Hi!

I build on MacOS with the Nix sandbox enabled. This is because I run a MacOS build worker which pushes into a company-shared cache; I want to isolate builds so as to make it as hard as possible for one malicious user to poison the cache.

The MacOS sandbox definition that Nix uses contains all the store paths, and has a relatively low max size which clearly is somewhere in the 500-800 nix store paths region.

I have a project with around 700 crate dependencies. Due to the cargo vendor process symlinking, this means that there are > num dependency crates store paths in any cargo build derivation, which places a limit on the number of crates one can depend on.

I'm wondering if this project would consider copying crates instead of symlinking? Happy to make an MR which makes the change.

The downside would be slightly greater disk usage, the benefit would be that bigger projects can be built on Mac with sandboxing!

ipetkov commented 11 months ago

Hi @j-baker thanks for the report!

As of today, downloading and unpacking crates happens in two derivations which is undoubtedly contributing to the increase in total derivation count. Sadly we cannot fold these into a single step because unpacking the tarball would result in a different hash (i.e. not the one in Cargo.lock, so we wouldn't be able to have it up front).

One thing we could do, is change vendorCargoDeps to take some kind of deferUnpack argument which would only download the tarball from the registry and build up some kind of manifest which maps the source to the crate name/version. Then we could do the expansion in the configureCargoVendoredDepsHook if present!


Possible workaround ideas in the meantime:

j-baker commented 11 months ago

Hi, thanks for the reply. I realised I was a little unclear in my previous message. Here is what my understanding is.

In step 1, Crane downloads crates. Each crate gets a derivation. The input store paths for this derivation are ~=

input: [] + binary dependencies
output: dep1
transitive dependencies: []

In step 2, Crane extracts these crates:

input: [ dep1 ] + binary dependencies (tar etc).
output: dep1_extracted
transitive dependencies: [] (we have extracted the tar - there are no references to nix store paths in the output).

In step 3, Crane 'vendors' these crates per registry.

input: [dep1, dep2, ..., depN]
output: [ln -s dep1 crates/dep1, ln -s dep2 crates/dep2, ..., ln -s depN crates/depN] (as registry1)
transitive dependencies: [dep1, dep2, ..., depN] (we have symlinks to those paths, Nix picks up on this).

In step 4, Crane builds the inputs cargo dir.

input [registry1, registry2, ..., registryN]
output [ditto step 3]
transitive dependencies: [registry1, registry2, ..., registryN, dep1, dep2, ..., depN, etc]

And from this point on, actual cargo commands run.

When the sandbox for each build is built, it is granted access to all transitive dependencies, as these are the totality of what might be depended on. On Linux this would refer to bind mounting the paths into the sandbox.

The two phase download&extract I don't believe contributes to the problem, because there is no transitive dependency passed on.

The problem I believe I'm facing is with sufficient input crates from a registry, the number of store paths that the output depends on becomes much too large for MacOS.

One brute force 'fix' to this problem is https://github.com/j-baker/crane/commit/2087e8b37c2f85a6b626c178046cbc7ae9dd94ba. This is not cost free - it converts symlinking of directories into a directory traversal, but it is a oneliner, so it has that going for it. This only works because I while my total sandbox size is too large, the sandbox size contributed by any single registry is not over the size limit for me, right now.

There are many levels of sophistication one could apply to reduce the likelihood of hitting this problem without adding runtime cost, however many of them likely lead to unnecessary complexity.

My sense however is that one workable solution (that'd be probably a few lines of code on top of what currently exists) is:

  1. Partition crates into some fixed number of components (e.g. 256). Using a hash function for the partition would ensure that e.g. updating a single crate would only change a single bucket, but this is not a hard requirement - the main point is to bound the number of crates.
  2. Extract crates together into a shared output directory, extracting and not symlinking. Symlink from there.

This would therefore kind of combine the extraction and vendoring steps. It would reduce the number of inputs to any one derivation.