rust-lang / cargo

The Rust package manager
https://doc.rust-lang.org/cargo
Apache License 2.0
12.69k stars 2.41k forks source link

Redefine `CARGO_TARGET_DIR` to be only an artifacts directory #14125

Open kornelski opened 4 months ago

kornelski commented 4 months ago

Problem

There are a couple of issues with the CARGO_TARGET_DIR that are seemingly in conflict with each other:

  1. Multiple locations of target dirs complicate excluding them from backups and full-disk search, cleanup of the temp files, moving temp files to dedicated partitions, out of slow network drives or container mounts, etc. Users don't like that the target dir is huge, and multiple instances of it add up to lot of disk space. Users would prefer a central location to ease management of the temp files, and also to dedupe/reuse dependencies across many projects.

  2. People (and tools) are relying on a relative ./target directory being present to copy or run built files out of there. Additionally, users may not want to configure a shared CARGO_TARGET_DIR due to risk of file name conflicts between projects.

However, the dilemma between 1 and 2 exists only because Cargo uses CARGO_TARGET_DIR for two different roles:

  1. A cache for all intermediate build products (a place where crates.io crates are built, where compiler-private temp files are) which aren't project-specific, and/or files that users don't need to access directly.
  2. A location for user-facing final build products (artifacts) that users expect to be there and need to access.

Proposed Solution

So to satisfy both uses, I suggest to change the thinking about what the role of CARGO_TARGET_DIR should be. Instead of thinking where to put the same huge all-purpose mixed CARGO_TARGET_DIR, think how to deduplicate and slim CARGO_TARGET_DIR, and move everything non-user-facing out of it.

Instead of merging or sharding the CARGO_TARGET_DIR as-is with all of its current content, and adding --artifact-dir as a separate place where final products are being copied to — make CARGO_TARGET_DIR to be the artifact dir (without copying).

As long as the CARGO_TARGET_DIR dir is the place for all of the build files, of all crates including all the crates.io and local builds, with all the caches, all the temp junk, then this is going to be a problematic large directory that needs to be managed. But if the purpose of the ./target dir was changed to be only for user-facing files (files that users can name, and would access via ./target path themselves), then this directory would be relatively small, with a good reason to stay workspace-relative.

What isn't an intermediate build product? (and should stay in ./target)

So generally files that users build intentionally, and may want to access directly (run themselves, or package up for distribution) and files that users may need configure their IDE and debugger to find inside the project.

Crates in [patch.crates-io] with a path are a gray area, an might also have their artifacts included in the ./target dir (but in some way that avoids clobbering workspaces' files).

What isn't a final build product, and doesn't belong to ./target:

All of these should be built in some other shared build cache dir (one that is not inside CARGO_TARGET_DIR), configurable by a new option/env var.

Registry dependencies would get unique paths derived from rustc version + package IDs + enabled features (so that different crates using different features don't invalidate each others' caches all the time). This would enable sharing built crates.io dependencies across all projects for the same local user, without also causing local workspaces to clobber each others' CARGO_TARGET_DIR/profile/product paths. Temp directories for local projects would need some hashed paths in the shared build/temp dir too.

Advantages

Notes

No response

kornelski commented 4 months ago

In terms of incompatibilities, only a few come to my mind:

epage commented 4 months ago

@poliorcetics from https://github.com/rust-lang/rfcs/issues/3664#issuecomment-2183534143

Yes, the issue should be moved to cargo I think.

I'm not convinced at all this won't break backwards compatibility in some way.

It makes ./target contain only workspace-unique files, which makes it justified for every workspace to have one.

And I don't want one in any cargo project while still keeping isolation, which is entirely different from what you are proposing.

It enables moving registry deps to a shared build directory, without side effect of local projects overwriting each others' files. Sharing of dependencies matches users' expectation that the same dependencies shouldn't be redundantly rebuilt for each local project.

Once again, the RFC I wrote and original issue I inspired myself from do not ask for that, it asks for the opposite: myself and many others want separate targets dirs for every project.

There are probably as many reasons as there are users for it but common ones are different sets of features amongst projects, sharing of build caches for specific projects, CI builds wanting to separate projects for security, or one project pinning 1.2.3 in a dep B of a dep A and the other project pinning 1.2.4: A can have the same version for both but it's dependencies won't and cargo is not made to handle the case at the moment.

You are fundamentally solving a different issue, one that the RFC I posted is not trying to solve.

epage commented 4 months ago

Overall, I see this as a solution alternative to #6790 and had recommended we have that conversation there (or on internals).

What isn't an intermediate build product? (and should stay in ./target)

This is likely going to be the most difficult topic to work through and we'll need to make sure we get wide input on this from #6790 users and others.

Registry dependencies would get unique paths derived from rustc version + package IDs + enabled features (so that different crates using different features don't invalidate each others' caches all the time). This would enable sharing built crates.io dependencies across all projects for the same local user, without also causing local workspaces to clobber each others' CARGO_TARGET_DIR/profile/product paths. Temp directories for local projects would need some hashed paths in the shared build/temp dir too.

imo this is out of scope for this proposal (see #5931) and we should keep this focused so as not to get distracted.

epage commented 4 months ago

I see this as a re-framing of the problem, addressing #6790 and rust-lang/rfcs#3371

Instead of us defining a new artifact-dir, we say target-dir is the artifact directory and move everything else out into a "working directory".

Potential names

Cargo script would default its target-dir as its working-dir

This would need input from

epage commented 4 months ago

This would need an audit of ways we publicly treat the target dir as a working dir, like exposing CARGO_TARGET_TMPDIR

ensc commented 3 months ago

Sharing sources over NFS adds another incompatibility when the ability to move ./target out of the sources is eliminated.

Every modern buildsystem allows to keep sources and build results separated (and users and tools do not have problems with it). I do not think that cargo should go the way back and enforce a fixed ./target directory.

kornelski commented 3 months ago

I'm not suggesting to force it to always be ./target. The CARGO_TARGET_DIR can continue to move this directory elsewhere. The main point is to reduce severity of problems that the current high-churn high-volume content of this dir causes.

epage commented 2 months ago

We talked about this in today's Cargo team meeting.

Our care abouts include

While we acknowledged the potential for user confusion with CARGO_TARGET_TMPDIR, we were fine with it not being associated with target-dir

The general shape of what we proposed in the meeting is...

Shiny future

target-work-dir: Home of intermediate artifacts

target-artifact-dir: Home of final artifacts

Legacy target-dir

Other

Path to Shiny Future

target-work-dir

In theory, we could trivially do this by

Initial default is "{workspace-root}/target"

Template supports

Steps

  1. Implement
  2. Call for testing
  3. Stabilize with opt-in to new location
  4. Call for testing
  5. Switch to opt-out

Notes:

target-artifact-dir

Assumption: target-work-dir takes some pressure off of target-artifact-dir

Defaulted to final location ("{workspace-root}/target/{legacy-platform}/{profile}")

Template supports

Needs all of the details in the tracking issue to be finalized.

Contingencies

If target-work-dir takes more than N time (1 year?) to stabilize, then we re-evaluate approving rust-lang/rfcs#3371. This is to try to balance the needs of the people who want something like rust-lang/rfcs#3371 now vs (1) the long-term inapplicability of that RFC and (2) the lack of stable "blessed" workflow for users (telling users to use solution X for several months and then telling them that is no longer "right" and they need to use solution Y).

Alternatives

epage commented 2 months ago

Something we overlooked in the above analysis is other "artifacts". In particular, I'm thinking of cargo package which places files in $CARGO_TARGET_DIR/packages. I'm assuming at least the .crates location is part our stable API. We'd need to decide about the files laid out on disk next to it.

Ways of solving this

clarfonthey commented 3 weeks ago

Just commenting here as I'm dealing with my own issues regarding the target dir, but personally, while it's nice for target to contain final build products only by default, I still will want all of these build products out of the target directory for the sake of excluding them from backups and snapshotting.

For some context, I use ZFS snapshots as a form of fast local backups; not long-term backups in case of hardware or extreme software failure, but decent short-term backups in case I accidentally delete a file or mess up an update. However, I explicitly go out of my way to exclude as many things as possible from auto-snapshotting that qualify as "cache" because they can very quickly clog up my disk if I'm not careful.

(Also: since snapshotting is a filesystem-level feature, I can't just say "don't save files of this type in snapshots" since snapshotting works by instantly freezing the state of the FS into a snapshot, and doesn't copy files over like a long-term backup would.)

For example, today I just deleted 200 GiB of snapshots of target directories. Not the current target directories, but past versions of them from previous snapshots. Snapshots are good for incremental stuff like code because they're copy-on-write, but the contents of a binary are effectively random to any snapshotting tool and they'll end up being fully duplicated every time they're snapshotted, and that means you can end up with several times that amount of data in snapshots until everything eventually gets old enough to be deleted. The "effectively random" part also applies especially to the final products, since while crates that don't change won't change in their compiled artifacts, the final linked products definitely will.

So, as far as I'm concerned, moving the final build products back into the workspace without also having the option to keep them out effectively un-solves the problem that moving the target directory was meant to solve. After all, the final build products, modulo LTO (which isn't really going to happen for debug builds) will effectively be the same size as all the intermediate products, so, that means that about half the disk usage will not be saved. (I'm extremely approximating here; the point is that it's a considerable amount of the disk usage, even if it's not half. Even 10% of the size is still a lot when you consider that these are being multiplied across several snapshots.)

And note that yes, other languages like Node and Python also have this exact same problem, but I don't think that other languages' inability to solve this problem forgives Rust not solving it. Also, even though node_modules can be massive, hundreds of GiB of binary artifacts is pretty hard to beat.

I love the idea of keeping intermediate products deduplicated and in one place. I just don't want that to obscure the goal of having the final products also somewhere else too.

kornelski commented 3 weeks ago

After all, the final build products, modulo LTO (which isn't really going to happen for debug builds) will effectively be the same size as all the intermediate products

No, the intermediate products are usually many many times larger. Not just double, they can be 1000× larger! On the project I'm currently working, a clean debug build of a 20MB executable creates 2300MB of junk in target/. After working on it for a while, it grows to 13GB of temp data for a 20MB result.

There are often many duplicate copies of libstd and other dependencies in each .rlib file. There is a lot of duplication across code units. There's plenty of completely unused objects included in the dependencies, and stripped even without LTO (rust relies on --as-needed flag). There are also often separate copies for build dependencies, builds with cfg(test), and incremental build cache.

epage commented 3 weeks ago

@clarfonthey the plan calls for both target-work-dir and target-artifact-dir to be templated so you can move their content out. It does not call out templating of target-dir as it calls for phasing that out. If we wanted to templatize it as a convenience way of setting both of the above, we'd likely want to wait for the above so we set the precedence for what people are generally expected to work with, rather than shifting expectations around on the user.

mathstuf commented 3 weeks ago

I still will want all of these build products out of the target directory for the sake of excluding them from backups and snapshotting.

FWIW, I've had decent results with target being a symlink to somewhere that is not subject to backups/snapshotting. cargo clean will remove the symlink, but everything else I've used is largely fine. Note that this puts intermediate and final artifacts into the same bucket.

nazar-pc commented 2 weeks ago

I'm in the same boat as @clarfonthey, but with BTRFS snapshots, which I create every 15 minutes and then stream to longer-term storage. My debug builds are easily 2.5G and I clean up target that is 300-700G basically every week.

I created ~/.cache/cargo/{git,registry,target} for this reason, where ~/.cache is in a separate subvolume that is not subject for snapshotting/backups. ~/.cargo/{git,registry} are symlinks now (still hoping cargo starts respecting XDG one day) because they also grow to substantial sizes (currently 1.08M files and 24.1G together).

Proposed separation (especially templating for both new options) should work nicely for such use case, CARGO_TARGET_DIR in .profile has major consequences for build times when jumping between projects.

Excited!