Automatic cleaning of obsolete build products

ggreif commented 9 years ago

When shake (as of v0.15) executes a recipe it places the wanted in some location. Sometimes source files move, though, so certain build products can become obsolete with time. It would be nice to see an automatic way of cleaning up known obsoletes (i.e. the difference between the old DB's wanteds and the current run's DB's wanteds). The cleanup could be done on another thread. This would be a real win when developers of huge build trees (with hours of CPU-time worth of build products) wouldn't face the necessity to radically clean the build products to get rid of the obsoletes. Tup (anecdotally) seems to have such a feature.

ndmitchell commented 9 years ago

Have you seen http://neilmitchell.blogspot.co.uk/2015/04/cleaning-stale-files-with-shake.html - is that roughly what you were thinking of? What else were you looking for?

Tup does have such a feature, but I consider it somewhat of a misfeature by default - often the outputs that were useful but are no longer relevant are useful to keep around. Examples include files the build system used to generate or files it only generates when certain options are passed - e.g. you always build x86, you sometimes build 64bit versions, when you don't build 64bit versions you don't want them deleting just because they are stale. That said, it's certainly useful to have, if carefully controlled.

ggreif commented 9 years ago

@ndmitchell Awesome, this is almost what I need. I had completely forgotten of that article! (Btw. you refer to a shakePrune function, did you mean shakeArgsPrune?)

So here is what I am after: a pruning function of type

pruneBeforeAfter :: Maybe [FilePath] -> [FilePath] -> IO ()

The first argument would be non-Nothing when shake encounters a loadable old build-state database. Just liveBefore would represent the union of build products leading to the previous run's all want targets. The second argument's file list would indicate the same for the current run.

ndmitchell commented 9 years ago

The docs for Shake need reorganising, and fundamental articles like that promoting to the website...

I did indeed mean shakeArgsPrune - I've updated the page.

So currently --live gives you all live entries. Maybe what you want is --complete which would give you all entries including those that weren't built this time, and after building the first is --complete and the second is --live?

ggreif commented 9 years ago

Yeah, I think that is pretty much what I want. If --complete gives me all build products (ever, even in former runs) produced by shake, then the difference complete \\ live is the set of files that can be routinely cleaned up without going for a hunt after all folders where stale build products might be retained. (In our case there might be thousands of build dirs scattered around.) Of course if --live only gives me the files which are needed for building the current target (in contrast to the potential valid build targets), then the above logic would be flawed. Time to play around with --live I guess...

ndmitchell commented 9 years ago

--live exists. --complete is easy to add. However, --live only gives you things needed for building the current target, since if you were to enter any of the old targets, they may well still build. Typically in most Shake systems a build with no arguments will do all the right wants so that everything rebuilds, and then you can treat --live as all possible live targets.

pthiemer commented 9 years ago

As @ggreif mentioned tup here, I'll add my 2cents:

Viewing tups tracking system as a misfeature most likely results from the expectation that you can do different builds with the same build graph (DAG) by exchanging external variables (CC). I would assume the better approach to use such feature would be to reflect all possible build results within the DAG.

To keep up with the 32/64-bit example: That would require one to provide different intermediate and final dependencies files per compiler version. The easiest solution would be to place the build products into compiler-related subdirectories. If one wants to build only the 32bit version, a phony target can be used to collect only the want executables of the 32bit build. Without the phony target, both versions will be build. With this setup, there is no need to reconfigure the DAG if I want to switch between build targets.

That said, it's definitively hard to get every possibly relevant parameter reflected within the DAG and thats the real problem with tup.

However, if you require a feature similar to tup I don't think complete \\ live will do. What you would need instead is a --buildable parameter which provides a list of all files that can still be build with the current DAG (regardless of if they have been build in the previous run). Once you have such a feature you can use complete \\ buildable to cleanup everything that can't be build any longer.

Mathnerd314 commented 8 years ago

So, bringing this up again: The feature of tup is that, if (for example) a C file is renamed, the old .o file will be deleted.

In Shake, we might start with the following dependency graph:

buildAndLink -> [GetDirFiles "src//*.c",[Build foo, Build bar],Link main.exe]
Build foo -> [src/foo.c, lots of headers, out/foo.o]
Build bar -> [src/bar.c, lots of headers, out/bar.o]
Link main.exe -> [GetDirFiles "out//*.o",[out/foo.o,out/bar.o]]

If we renamed bar to baz, we would want to get:

buildAndLink -> [GetDirFiles "src//*.c",[Build foo, Build baz],Link main] -- changed!
Build foo -> [src/foo.c, lots of headers, out/foo.o]
Build bar -> [src/bar.c, lots of headers, out/bar.o] -- stale!
Build baz -> [src/baz.c, lots of headers, out/baz.o]
Link main -> [GetDirFiles "out//*.o",[out/foo.o,out/baz.o],main.exe] -- tricky!

The tricky part is that Shake needs to delete bar.o before linking, so that the GetDirFiles call in Link doesn't add a stale object file. How does it know to do this? Because bar.o was generated by Build bar, and buildAndLink lost its dependency on Build bar, and that was the only reference to Build bar.

When do we know this? Well, if we are careful in matching the stored result to the current execution, we will know as soon as buildAndLink calls need [Build foo, Build baz] instead of need [Build foo, Build bar]. So, we can't prune before building, as in #432, but we can prune during building, whenever dependencies are called; and at least for this example it is just early enough that we do not have to worry about botched linking.

It's true that, as @ndmitchell said, we might not want this pruning; e.g. if we read a config file and from that determine whether to call BuildAndLink X86 or BuildAndLink X64, then modifying our config file will prune one of those. But a new needNoPrune method would suffice. Shake's existing liveness / pruning feature can coexist with this (it will be useful to delete unused config variants).

The main differences in Shake's internals are that generated files would have to be tracked better (to determine that Build bar generated out/bar.o, without rerunning it), and also that Shake would have to maintain a back-reference mapping (to determine that nothing besides buildAndLink referenced Build bar). And it would also be good to add a compact command that renumbered the Id's so they were sequential, now that Shake can remove keys completely

ndmitchell commented 8 years ago

I think the compact command is a good idea anyway - I was intending to call it something like gc.

One note with the example - you have:

Link main.exe -> [GetDirFiles "out//*.o",[out/foo.o,out/bar.o]]

However, calling GetDirFiles "out//*.o" is a lint violation, and getDirectoryFiles explicitly says:

As a consequence of being tracked, if the contents change during the build (e.g. you are generating .c files in this directory) then the build not reach a stable point, which is an error - detected by running with --lint. You should only call this function returning source files.

The recommendation above would be for the linking to list all .c files, not all .o files.

Given this particular pattern is a violation, I wonder if some combination of build, lint, delete dead files, lint would also pick up the violations?

Mathnerd314 commented 8 years ago

In my branch I changed that piece of documentation to:

As a consequence of being tracked, it is an error (detected by running with @--lint@) if the contents change during the run after this function is called (e.g. you are generating @.c@ files in this directory), since the build does not reach a stable point. You should only call this function after generating all relevant files.

I think my version is more correct, since lint will indeed not give an error if a file is added to the directory before getDirFiles is called.

From the paper, the three requirements of Shake rules are:

If an IO action makes use of some IO state, then the rule must depend on that IO state.
If an IO action makes use of some IO state that is modified by the build system, then the rule must depend on that IO state before performing the IO action.
After some IO state becomes a dependency it must not change for the rest of the build run.

The action / state / rule distinction is a little unclear in this example, but it seems the Link rule is valid.

I wonder if some combination of build, lint, delete dead files, lint would also pick up the violations?

Deleting files after the build without an intervening rebuild would indeed make lint fail, but I am not sure why you would expect lint to succeed after such a deletion. Lint will error if any tracked file is deleted...

ndmitchell commented 8 years ago

Your version is correct - but only if people have meaningful control over when a rule is started, and thus when getDirFiles is called. I'm not sure people do have that level of control, since when checking if rebuilds are necessary we are essentially "speculatively" running rules that haven't yet been required, but that we think might be required.

Mathnerd314 commented 8 years ago

when checking if rebuilds are necessary we are essentially "speculatively" running rules that haven't yet been required, but that we think might be required.

True. But this is why Depends has two layers of list; it is only partially speculative. If we assume that building the rule will use the same dependencies in the same order for the same inputs, then the speculation is justified, as each dependency will be called regardless, up to the first changed dependency.

Do people have meaningful control over when a rule is started, and thus when getDirFiles is called?

I think they do, when they can control what rules call said rule. In particular, a pattern of the form need [a] >> need [b] should ensure that b is only called after a has been run, if no other rule calls b.

ndmitchell / shake

Automatic cleaning of obsolete build products #273