nix-community / crate2nix

rebuild only changed crates in CI with crate2nix and nix
https://nix-community.github.io/crate2nix/
Apache License 2.0
382 stars 89 forks source link

IFD is crazy slow due to copying the entire tree #347

Open Ten0 opened 6 months ago

Ten0 commented 6 months ago

The src directory for IFD is not filtered: https://github.com/nix-community/crate2nix/blob/cf034861fdc4e091fc7c5f01d6c022dc46686cf1/tools.nix#L50

This asks to copy the entire workspace tree to the nix store, performing no filtering. On large repositories containing build artifacts, this is impractical: it may sometimes represent hundreds of gigabytes, and causes the build to freeze forever while it's trying to copy those hundreds of gigabytes to the store, when all one really needs is all the Cargo.tomls and Cargo.lock.

Fixing

EDIT: see updated solution below instead

It looks like this may be fixed by using cleanSourceWith:

            src = lib.cleanSourceWith {
              inherit src;
              filter = path: type: (
                let baseName = baseNameOf (toString path);
                in
                (
                  (type == "directory" && baseName != "target" && baseName != "node_modules")
                  || (baseName == "Cargo.toml" || baseName == "Cargo.lock")
                ) && (lib.cleanSourceFilter path type) # + other basic filters from nixpkgs
              );
            };

Although this still has the annoying property that the derivation hash will depend on a bunch of unnecessary empty directories (the hash shouldn't depend on any empty directory either), it makes it run reasonably, so it would still be a clear improvement.

Alternatively lib.fileset may be an option.

Also cleaning up empty directories

This post seems to suggest that lib.fileset may by default already be filtering empty directories.

Alternatively

According to this post, adding another layer to further recursively clean unnecessary empty directories:

fullyClean = stdenvNoCC.mkDerivation {
    inherit (mostlyClean) name;
    src = mostlyClean;
    phases = ["unpackPhase" "buildPhase" "installPhase"];
    buildPhase = "find . -type d -empty -exec rmdir {} \+";
    installPhase = "cp -pr --reflink=auto -- . $out";
    # This is such a quick operation that queing a remote builder isn't worth
    # the effort; `preferLocalBuild' will cause this to be run locally.
    preferLocalBuild = true;
  };

may allow preserving the source hash regardless of whether new empty directories are created, but I'm not sure what prevents this from being source addressed so maybe I misunderstand something. Maybe that + another layer of IFD re-copying the cleaned source would fix it but that seems incredibly dirty.

The post also presents another option that does not have this issue, but which is obviously bad "because nested directories at depth N are redundantly processed N - 1 times.".

It looks like doing the regular directory traversal that filterSource does, but only creating a directory once we're about to put an actual file inside, is something that should already exist in Nix though... I can't be the first looking for this, so it must be in lib.fileset...

Ten0 commented 6 months ago

After some testing, since lib.fileset does remove empty directories, the following works:

            src = lib.fileset.toSource {
              root = src;
              fileset = lib.fileset.fromSource (lib.cleanSourceWith {
                inherit src;
                filter = path: type: (
                  let baseName = baseNameOf (toString path);
                  in
                  (
                    (type == "directory" && baseName != "target" && baseName != "node_modules")
                    || (baseName == "Cargo.toml" || baseName == "Cargo.lock" || baseName == "lib.rs" || baseName == "main.rs")
                  ) && (lib.cleanSourceFilter path type) # + other basic filters
                );
              });
            };

It's still not completely ideal because we actually don't care about the contents of the lib.rs and main.rs files, we only care about them being there or not. (They are used by cargo metadata to determine crates automatically when there are no [[lib]] or [[bin]] targets). Another point to solve may be how bench/tests targets are inferred, so we may need to copy a bit more of the tree here as well. (Maybe we want to copy any .rs file that's in a tests folder but that on the other hand would be too much...)

Pandapip1 commented 5 months ago

Potential solution: two additional parameters, hasLibRs and hasMainRs. Then, pass in only the Cargo.lock.