pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.33k stars 638 forks source link

Move Pants pids and workdir folders outside repos to avoid inhibiting some git-stash usage scenarios #21371

Open stormfish-sci opened 2 months ago

stormfish-sci commented 2 months ago

Is your feature request related to a problem? Please describe. By default pants stores metadata inside the repository in folders named .pants.d and .pids and perhaps other places. This inhibits the use of certain git-stash features, namely "git stash push -a", because this will trigger a git clean removing all git-ignored and untracked files.

Steps to reproduce:

Issues observed: This will like result in conflicts and inability to restore stashes fully. Also, pants may lose track of pantsd instances since pid files are removed.

Describe the solution you'd like In situations where a developer needs to stash all content, including untracked and git-ignored files, it would be good to not have git stash undermine pants through destruction of .pids and related pants data. Similarly, avoiding large and complex stash data from pants in git-ignored files would be nice.

The use of git-clean can also be problematic in pants repos for the same reason.

The settings in pants.toml, such as pants_subprocessdir, pants_workdir, and pants_physical_workdir_base, might provide a solution. However, manually setting workdir and other settings for each repo seems to require care to prevent conflicts if the same pants.toml is copied between repos or when two instances of the same repo are checked out on the same system.

It may be worthwhile to consider a different approach for storing pants metadata such as using a centralized folder, maybe ~/.pants/, in which pants could create and store its metadata in a hierarchy that automatically de-conflicts data from multiple pants repos (e.g. sha1 hash of repo root directory). This might look like this:

repo 1: /home/johnd/git/repo_alpha

repo 2: /home/johnd/git/repo_bravo

In this manner pants could easily find its metadata in something like the following structure:

repo_alpha metadata found here: /home/johnd/.pants/repos/3498441313ee0ea7efd65877bfe338e5483ff698/pids /home/johnd/.pants/repos/3498441313ee0ea7efd65877bfe338e5483ff698/workdir

repo_bravo metadata found here: /home/johnd/.pants/repos/9adb745930592c03aa79acd5f1c7eb960b423fc3/pids /home/johnd/.pants/repos/9adb745930592c03aa79acd5f1c7eb960b423fc3/workdir

As an aside, storing dist build data (pex, etc) in this structure might also be useful as a git stash needing to store all the git-ignore'd build data could contribute to large stashes.

Describe alternatives you've considered

As stated above, manually configuring pants.toml settings to use folders outside repo could work. I fear that this would make pants fragile if, again, multiple repos use copies of the same pants.toml file or the same repo is check out twice (or git worktrees). The different pants runs would try to use the same pids and workdir folders and potentially conflict. This potential issue would be mitigated by the sha1 or similar approach described above.

Additional context I could be wrong and perhaps the caching of pants metadata in stashes is desirable, so I would value insights from others who may have encountered this situation.

benjyw commented 2 months ago

pants_physical_workdir_base is probably a decent solution here. Contra your fear above, the workdirs are differentiated between different local checkouts, even of the same repo, so there should be no conflicts. But there are two issues:

  1. It is implemented as a symlink from an in-repo (presumably gitignored) location, and that symlink will get cleaned. However I think it will get automatically recreated on next run, so maybe that is fine.
  2. More significantly, pants_subprocessdir is not handled this way, and it would need to be made to. I don't think that is super hard, if you want to tackle it.

There is something to be said about keeping this data outside the git tree in the first place. For one thing, you don't have to modify your .gitignore to accommodate the tool. I can't think off hand of any negative consequences, but there may be some subtle ones. For one thing Pants would need to have write access to that location, but I suppose since you can configure it, that is not a big deal.

benjyw commented 2 months ago

So basically I think moving all this outside the git worktree entirely makes sense, but I would raise this on #development in slack to see if anyone can think of any downsides or challenges.

cburroughs commented 2 months ago

Could you elaborate a bit on the git stash push -a workflow?

I'm not suggesting there are hard and fast rules here, Pants for example places the cache under XDG_CACHE_HOME. But I'm not sure I've used any tool that is so scrupulous so as to avoid writing any likely-to-conflict files to the checked out directory.

stormfish-sci commented 2 months ago

Could you elaborate a bit on the git stash push -a workflow?

* Maven and other JVM tools put files in the repo under `target`.  rust/cargo have a similar convention

* Python tools usually place a virtualenv in the repo, such as `.venv`, and the `__pycached__` directories would also be in repo.

* Node has  `node_modules`

* When last I used C, object files were next to the .c files

I'm not suggesting there are hard and fast rules here, Pants for example places the cache under XDG_CACHE_HOME. But I'm not sure I've used any tool that is so scrupulous so as to avoid writing any likely-to-conflict files to the checked out directory.

Great question. In our case we have Unity3D applications in our monorepo alongside our python code. We haven't yet tried to integrate with pants, but that is not important here, regardless we run python code via pants in the same monorepo while doing Unity dev and everything lives together quite well.

So on to the workflow. Probably akin to the systems you referenced, Unity3D builds very large cache files of pre-compiled code, intermediate data, and library files that needs to excluded from the repo. We had a few painful incidents where developers switched branches to fix a bug or help out another developer and when they did so, their cache data was negatively impacted (or corrupted) by the differences in the code from the other branch.

Enter "git stash --all"

Git stash with the --all flag will make a copy of every single untracked and git-ignored file in your repository, can take a descriptive name, and then enable you to pull it all back out whenever you need it. This means that developers can save their entire working environment, to include all the critical non-repository data, so that they can have a sterile working tree before switching to another dev branch. When they are done, they can stash away the peripheral data needed to work in that branch in a new stash so they can switch back and forth as need.

Here's the magic, when the developer returns to their original branch and applies the stash, every single untracked and git-ignored file that was part of their workspace is returned to precisely the state they had before they left for the other branch!

It's been a game-changer for us. The one downside has been a few hiccups with pants due to its files getting caught up in the machinery.

gruzewski commented 1 month ago

The current behaviour also makes working with git worktree tricky as each directory spins up a new daemon.

514.281MB       pantsd  [/home/user/worktrees/feature-1]
540.883MB       pantsd  [/home/user/worktrees/feature-2]
3019.58MB       pantsd  [/home/user/worktrees/feature-3]

I am posting here as another data point.

stormfish-sci commented 1 month ago

The current behaviour also makes working with git worktree tricky as each directory spins up a new daemon.

514.281MB       pantsd  [/home/user/worktrees/feature-1]
540.883MB       pantsd  [/home/user/worktrees/feature-2]
3019.58MB       pantsd  [/home/user/worktrees/feature-3]

I am posting here as another data point.

This is a great use case. Thanks for highlighting it.

The .git files in each worktree would be able to guide pants back to the main repository and in turn to the correct working directory for pantsd, and other pants artifacts.

outterback commented 1 month ago

The current behaviour also makes working with git worktree tricky as each directory spins up a new daemon.

514.281MB       pantsd  [/home/user/worktrees/feature-1]
540.883MB       pantsd  [/home/user/worktrees/feature-2]
3019.58MB       pantsd  [/home/user/worktrees/feature-3]

I am posting here as another data point.

This is a great use case. Thanks for highlighting it.

The .git files in each worktree would be able to guide pants back to the main repository and in turn to the correct working directory for pantsd, and other pants artifacts.

The .git file in the worktree contains a path of the form

/path/to/main/repository/.git/worktrees/worktree-name

which might require some parsing. Another alternative is to use the git worktree list command ref.

List details of each worktree. The main worktree is listed first, git worktree list gives you output of the style

/path/to/main/repository                                 git_hash [branchname]
/path/to/main/repository.worktrees/potentially/long/name git_hash [branchname]

which might be easier to parse.

Other alternatives with the git CLI here: https://stackoverflow.com/a/68754000

cburroughs commented 1 month ago

The current behaviour also makes working with git worktree tricky as each directory spins up a new daemon.

Each worktree might use a different version of Pants or otherwise different settings, and allows concurrent use. Those are all reasonable use cases today that would need a path forward.