Open greened opened 6 years ago
ccache
or other solutions. Also I think that this should be a command line flag instead, so that it doesn't require changes to the build definitions.It can't (or at least should not) be a command line flag, as then hashing would apply to all rules. Hashing e.g. all inputs of link rules is expensive and therefore not desired, while source files and their known dependencies are good candidates. For that distinction, it has to be part of the build description. In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.
Responding to the single-threaded argument: Yes, it adds instructions to the single-thread loop. In reality that only matters if the single thread gets more work than it can work down (I.e. more rules finish than the one thread can process (depslog+hashlog+...)). Only then hashing hurts. Else the single thread loop waits for jobs to finish anyway. We never saw a busy ninja with hashing even with -j1000 experiments. (And for fast-finishing rules, hashing is anyway not interesting to safe time.)
Also consider: the hashing with murmur hash is considerably fast and even large source files take only some miliseconds to be hashed. In addition the hashing happens right after the source file (and the dependencies) have been seen by the compiler. Therefore they are usually read from filesystem cache. As hashing happens during the build (in parallel to executed rules), the overall build time is usually not measureably affected.
Lastly, the implementation in #929 is opt-in and comes at no cost for people not using the feature (besides the if-statement).
Hashing e.g. all inputs of link rules is expensive and therefore not desired, while source files and their known dependencies are good candidates.
I would say that hashing of inputs to the linker is especially desired as it can often result in the linking being skipped completely (e. g. for formatting changes or when comments are changed). As the compilation of object files finishes piece by piece, the hash calculation can happen while the build is running (as you pointed out).
If it's really too slow (e. g. with big static libraries), we could think about implementing hashes only for pure inputs and not intermediate files. That would solve the "switching Git branches causes full rebuilds" case at least.
In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.
I would say that is an advantage: If I'm working on a single branch and want to iterate fast, I would not use hashes. If I'm comparing different features branches, I would use hashes.
In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.
I would say that is an advantage: If I'm working on a single branch and want to iterate fast, I would not use hashes. If I'm comparing different features branches, I would use hashes.
But then there would need to be a way to switch from non-hashing to hashing, which means that the current state of files would need to be hashed, so the next rebuild can use the hashes (which probably didn't exist or are out of date if you didn't pass the flag).
I would guess that hashing still uses timestamp first, so that if the timestamps match, there's no need to compare the hashes. It would mean, that the first few builds might unnecessarily recompile some files, but that shouldn't happen often (most of the time the timestamp heuristic is right after all).
I've also used hashing at work, with great success. It is based on #929, but with a bunch of patches, as can be seen in https://github.com/moroten/ninja/commits/hashed. hash_input = 1
on selected rules is very convenient. My branch still contains a bug where files are stat
too often, O(n^2)
instead of O(n)
. The bug is related to phony edges.
One problem is how to handle phony edges. I use phony edges to group e.g. header files. Therefore, my implementation iterates recursively through phony edges. It also relates to the bug in #1021.
Another thought is to make the hashing first class member of ninja, i.e. move the hashes into the build log. Using SHA256 would be one step towards adding support for Bazel's remote execution API. A C++ implementation can be found at https://gitlab.com/bloomberg/recc/. Wouldn't that be pretty nice?
Unfortunately, sorting out the semantics for phony edges will probably break backward compatibility.
Meanwhile I have concocted a solution for my use case of switching branches in Chromium (which is particularly painful); the small Go program and script I use is here: https://github.com/bromite/mtool
Feel free to adapt it to your use-cases, if it works for Chromium I bet it will work for smaller build projects as well (and it takes negligible time for my runs). Only downside is that if you use the script I published there as-is it will litter .mtool
files all around in each git repository parent directory, but nothing a global gitignore
cannot cure.
Interesting to note that I use the git ls-files --stage
output for the hashing needs; it is possible (but less efficient) to ask git to hash also non-indexed files if your build depends on those.
Feature-wise one would expect that to implement the feature being discussed here ninja could do the same internally (without relying on git) and with similar performance results.
After a quick look those ninja patches don't seem to hash the compiler commandline in addition to the input files. Am I missing anything?
The command line is already hashed by ninja and stored in the build log.
I would guess that hashing still uses timestamp first, so that if the timestamps match, there's no need to compare the hashes. It would mean, that the first few builds might unnecessarily recompile some files, but that shouldn't happen often (most of the time the timestamp heuristic is right after all).
That would be very wrong. Comparing hashes can be elided if timestamps don't match; the build system can assume that the dependency was modified, and if it wasn't really modified (only touched) then the build will be suboptimal but correct. However if the timestamps match, it is still possible that the dependency was modified and its timestamp was forcibly reset (EDIT: or that the target was touched and thus made newer than its dependencies). Without double checking by comparing hashes this would result in incorrect build.
I guess Ninja already assumes and skips all the work when timestamps match. So in your example this would be an incorrect build anyways.
I'm not aware (probably due to my ignorance) of any tool that would produce a different output and would keep the previous timestamp. Why would one do that?
IMHO, skipping the hash check when the timestamps match is a very valid optimization.
Hash check would simply avoid some "false-dirty" rebuilds, while keeping the existing semantics already provided by Ninja.
The problem is not false-dirty rebuilds, it's false-clean non-rebuilds. A git checkout touches everything it overwrites. It can make a target newer than a dependency (yes, people commit generated code for various valid reasons). A hash check would prevent a false-clean non-rebuild in this case.
@rulatir I think I mostly understand what you're saying :) I guess that's one of the reason why Bazel and other build systems based on hash-checks are really against in-tree target outputs.
Although, wouldn't this problem be solved if build system would check if targets are newer than previously known time, and rebuild if necessary?
@rulatir I think I mostly understand what you're saying :) I guess that's one of the reason why Bazel and other build systems based on hash-checks are really against in-tree target outputs.
How does hash depend on where the file is?
Although, wouldn't this problem be solved if build system would check if targets are newer than previously known time, and rebuild if necessary?
I understand that the main benefit of using timestamps is in avoiding the need to maintain a separate database that keeps track of "previously known" version signatures. If you are willing to forego that benefit, why shouldn't those signatures be hashes?
A strategy we employ in large distributed content pipelines (not using ninja, but it's all similar concepts) uses hash caches keyed off a triplet like (timestamp; file size; inode)
.
The chances in a typical and non-malicious environment where a file changes without one of those three changing is incredibly low... and ninja has no business trying to protect itself against malicious manipulation IMO, since users can just manipulate build files and ninja caches at that point; if they want to break things, they will.
I will note that the timestamp used in a hash cache should be considered out of date if it's at all different, not just newer-than. Otherwise rolling back a change to a character in a source file would go undetected.
For those possible cases where the triplet isn't reliable to detect file mutation (I've personally never encountered such a case, but of course I haven't encountered every build environment that exists!), a --rehash
flag could force a rehash of every file considered by the build. I've always added those kinds of things to content pipeline tools, though I've never needed them outside of testing/verification of the hash cache itself.
We've in the past considered extending those triplets with additional information, e.g. other file attributes or permissions, as well as adding options to use underlying hash metadata on some esoteric file systems or some SCM backends, but the basic triplet has proven reliable for us so for on all systems we've cared about.
For content pipelines, the hash cashing is essential since the files are often very large and there's often very many of them (far more than there are source code files, generally) and content hashing scales with total file size, of course. I've never measured what the difference would be in source code builds since I've generally just relied on standard ninja/make/msbuild behavior for such things. I imagine the results would be less impactful than with content pipelines, but probably still worth it for large projects like Chromium or the typical AAA game's code base.
@seanmiddleditch the problem I find with anything timestamp is that a branch switch forces unnecessary rebuilds. Just touching a file marks the target dirty. Having a proper hashing at least as an opt-in strategy could save many people lots of unnecessary build time, IMHO.
@burdiyan that's the whole point of the hashes. The timestamps are just an acceleration for the common case so you aren't needlessly rehashing all your files when nothing's changed on incremental builds.
(edit) I mean the timestamps in what I proposed are just an acceleration; status quo ninja today has the problems you describe, of course.
@burdiyan that's the whole point of the hashes. The timestamps are just an acceleration for the common case so you aren't needlessly rehashing all your files when nothing's changed on incremental builds.
It's 2022. Version control systems that destroy all assumptions about timestamps are a reality now. THEY are the common case.
Just to make it very clear: I am not arguing against hash-based builds. :) I'm just offering that adding a cache in front of the hashes helps maintain very fast local builds. It's a "best of both worlds" approach: hashes when the VCS stomps all over timestamps, and timestamps when files are completely untouched. Win-win.
I agree with that! In my wacky scripts I do something similar, i.e. check the timestamp first, and if it's different check the hashes. So indeed win-win. Pretty much always it's faster to hash a file even if it's unnecessary than to run an unnecessary build, IMHO.
I recommend that anyone working on this feature take the approach of providing an escape hatch, in the form of an arbitrary command, with variable substitutions provided via the typical ninja substitution rules.
I have a project which I originally wanted to use Ninja for, but decided not to because the "is this target in need of rebuild?" logic was too hard-baked.
Specifically, video encoding of a large collection of video files (e.g. for a Plex server), where the rebuild can take >4+ hours per "build", and being able to resume from a partial build is important.
My solution was to have the "build" store information in linux filesystem extended attributes. Specifically the md5sum and sha256 sum of the file, as two seperate attributes, and then also the md5sum and sha256 sum of the commandline arguments used for the builder (handbreak's cli).
Because of this, I don't care at all about timestamps of the file. They're irrelevant to me. But I do care about being able to control the logic for checking if the file needs to be rebuilt myself.
Simply being able to have ninja hash the source and destination files isn't good enough, Ninja would put those hashes in it's log, and not in the files themselves, making it very easy to find myself needing to spend a month or more of re-encoding files that are already encoded correctly, if my log gets corrupted, or lost.
Edit:
Of course, it's certainly practical to have some behaviors built in. E.g. default the "?rebuild?" check to timestamp. Then also supply "hash" as a strategy that's "baked in" for sake of build speeds.
I'm having a case where the build machine periodically resets its time to 00:00 which wreaks havok in ninja + cmake (it seems to make ninja call cmake's configure in a loop), it would be great to have support for hash-based rather than time-based things to overcome this.
While this might fix this particular fallout of that root problem, I suspect that you really need to get that fixed because this is not the only thing on a modern system that cares about the system clock being…useful.
It really can't - the system is synced to an external PTP clock which behaves like this.
I would doubt that such behavior is useful for a "Precision Time Protocol" device. I highly recommend you solve the root cause because this issue isn't moving anywhere near fast enough for your problem (cf. basically nowhere beyond ideation so far).
this issue isn't moving anywhere near fast enough for your problem
Don't worry, according to this PTP clock, there'll be plenty of time remaining to fix it.
(sorry, couldn't help it)
I highly recommend you solve the root cause
Not much possibility - those are proprietary audio devices which act as a master clock setting to synchronize themselves independently of real-world time ; since the linux driver is in-kernel this sets the whole machine to the same clock.
Sad, I want to know the final solution...
If someone wants to build a system based on (potentially) inotifywait that ties in as an easy to use plugin and for a given target simplifies the output of "-t commands" to just that set of commands necessary to recompile steps dependent on files written to since the last build in the directory, I will personally mail you 5 USD as a bounty. 10 USD and a can of room-temp/warm Chicago beer if you make it so that it "just works", where the next run of a target with the plugin enabled only runs the commands related to the changes.
No idea if this solves all the hashing nonesense or whatever Bazel is calling it nowadays (hermetic ... synonym "esoteric" ... at least they are up front about it (-; ). Anyways, I think inotifywait is a good approach, at least in some simple case. I'll try it once I get some time, but if it gets to that point I'll post results here and the bounty will be gone.
I'm also interested on this.
As noted in other issues, using timestamps to determine whether to rebuild something can be problematic. While timestamps are convenient and relatively fast, it is often desirable to key build decisions on some characteristic intrinsic to the file itself, such as a hash of its contents.
I use git a lot and it is annoying that simply changing branches triggers rebuilds. Ideally, I would be able to switch from my current work branch to some other branch, not touch any files and then switch back to the original branch and not have to rebuild anything at all. As far as I know that is not possible if the build system uses timestamps. Using file hashes would solve this particular issue.
I understand the using file hashes or other such intrisic file characteristics could slow down Ninja so using them should be an option.