greened commented 6 years ago

As noted in other issues, using timestamps to determine whether to rebuild something can be problematic. While timestamps are convenient and relatively fast, it is often desirable to key build decisions on some characteristic intrinsic to the file itself, such as a hash of its contents.

I use git a lot and it is annoying that simply changing branches triggers rebuilds. Ideally, I would be able to switch from my current work branch to some other branch, not touch any files and then switch back to the original branch and not have to rebuild anything at all. As far as I know that is not possible if the build system uses timestamps. Using file hashes would solve this particular issue.

I understand the using file hashes or other such intrisic file characteristics could slow down Ninja so using them should be an option.

metti commented 6 years ago

929 has an implementation. Even though it is successfully used (in a fork) for thousands of builds daily, it was not considered for merging.

jhasse commented 6 years ago

929 is single-threaded and can therefore be slower like `ccache` or other solutions. Also I think that this should be a command line flag instead, so that it doesn't require changes to the build definitions.

metti commented 6 years ago

It can't (or at least should not) be a command line flag, as then hashing would apply to all rules. Hashing e.g. all inputs of link rules is expensive and therefore not desired, while source files and their known dependencies are good candidates. For that distinction, it has to be part of the build description. In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.

Responding to the single-threaded argument: Yes, it adds instructions to the single-thread loop. In reality that only matters if the single thread gets more work than it can work down (I.e. more rules finish than the one thread can process (depslog+hashlog+...)). Only then hashing hurts. Else the single thread loop waits for jobs to finish anyway. We never saw a busy ninja with hashing even with -j1000 experiments. (And for fast-finishing rules, hashing is anyway not interesting to safe time.)

Also consider: the hashing with murmur hash is considerably fast and even large source files take only some miliseconds to be hashed. In addition the hashing happens right after the source file (and the dependencies) have been seen by the compiler. Therefore they are usually read from filesystem cache. As hashing happens during the build (in parallel to executed rules), the overall build time is usually not measureably affected.

Lastly, the implementation in #929 is opt-in and comes at no cost for people not using the feature (besides the if-statement).

jhasse commented 6 years ago

Hashing e.g. all inputs of link rules is expensive and therefore not desired, while source files and their known dependencies are good candidates.

I would say that hashing of inputs to the linker is especially desired as it can often result in the linking being skipped completely (e. g. for formatting changes or when comments are changed). As the compilation of object files finishes piece by piece, the hash calculation can happen while the build is running (as you pointed out).

If it's really too slow (e. g. with big static libraries), we could think about implementing hashes only for pure inputs and not intermediate files. That would solve the "switching Git branches causes full rebuilds" case at least.

In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.

I would say that is an advantage: If I'm working on a single branch and want to iterate fast, I would not use hashes. If I'm comparing different features branches, I would use hashes.

Boddlnagg commented 6 years ago

In addition, to make use of the feature you should use the flag consistently all the times. Not just sometimes.

I would say that is an advantage: If I'm working on a single branch and want to iterate fast, I would not use hashes. If I'm comparing different features branches, I would use hashes.

But then there would need to be a way to switch from non-hashing to hashing, which means that the current state of files would need to be hashed, so the next rebuild can use the hashes (which probably didn't exist or are out of date if you didn't pass the flag).

jhasse commented 6 years ago

I would guess that hashing still uses timestamp first, so that if the timestamps match, there's no need to compare the hashes. It would mean, that the first few builds might unnecessarily recompile some files, but that shouldn't happen often (most of the time the timestamp heuristic is right after all).

moroten commented 6 years ago

I've also used hashing at work, with great success. It is based on #929, but with a bunch of patches, as can be seen in https://github.com/moroten/ninja/commits/hashed. hash_input = 1 on selected rules is very convenient. My branch still contains a bug where files are stat too often, O(n^2) instead of O(n). The bug is related to phony edges.

One problem is how to handle phony edges. I use phony edges to group e.g. header files. Therefore, my implementation iterates recursively through phony edges. It also relates to the bug in #1021.

Another thought is to make the hashing first class member of ninja, i.e. move the hashes into the build log. Using SHA256 would be one step towards adding support for Bazel's remote execution API. A C++ implementation can be found at https://gitlab.com/bloomberg/recc/. Wouldn't that be pretty nice?

Unfortunately, sorting out the semantics for phony edges will probably break backward compatibility.

csagan5 commented 6 years ago

Meanwhile I have concocted a solution for my use case of switching branches in Chromium (which is particularly painful); the small Go program and script I use is here: https://github.com/bromite/mtool

Feel free to adapt it to your use-cases, if it works for Chromium I bet it will work for smaller build projects as well (and it takes negligible time for my runs). Only downside is that if you use the script I published there as-is it will litter .mtool files all around in each git repository parent directory, but nothing a global gitignore cannot cure.

Interesting to note that I use the git ls-files --stage output for the hashing needs; it is possible (but less efficient) to ask git to hash also non-indexed files if your build depends on those.

Feature-wise one would expect that to implement the feature being discussed here ninja could do the same internally (without relying on git) and with similar performance results.

Trass3r commented 6 years ago

After a quick look those ninja patches don't seem to hash the compiler commandline in addition to the input files. Am I missing anything?

metti commented 6 years ago

The command line is already hashed by ninja and stored in the build log.

rulatir commented 4 years ago

I would guess that hashing still uses timestamp first, so that if the timestamps match, there's no need to compare the hashes. It would mean, that the first few builds might unnecessarily recompile some files, but that shouldn't happen often (most of the time the timestamp heuristic is right after all).

That would be very wrong. Comparing hashes can be elided if timestamps don't match; the build system can assume that the dependency was modified, and if it wasn't really modified (only touched) then the build will be suboptimal but correct. However if the timestamps match, it is still possible that the dependency was modified and its timestamp was forcibly reset (EDIT: or that the target was touched and thus made newer than its dependencies). Without double checking by comparing hashes this would result in incorrect build.

burdiyan commented 4 years ago

I guess Ninja already assumes and skips all the work when timestamps match. So in your example this would be an incorrect build anyways.

I'm not aware (probably due to my ignorance) of any tool that would produce a different output and would keep the previous timestamp. Why would one do that?

IMHO, skipping the hash check when the timestamps match is a very valid optimization.

Hash check would simply avoid some "false-dirty" rebuilds, while keeping the existing semantics already provided by Ninja.

rulatir commented 3 years ago

The problem is not false-dirty rebuilds, it's false-clean non-rebuilds. A git checkout touches everything it overwrites. It can make a target newer than a dependency (yes, people commit generated code for various valid reasons). A hash check would prevent a false-clean non-rebuild in this case.

burdiyan commented 3 years ago

@rulatir I think I mostly understand what you're saying :) I guess that's one of the reason why Bazel and other build systems based on hash-checks are really against in-tree target outputs.

Although, wouldn't this problem be solved if build system would check if targets are newer than previously known time, and rebuild if necessary?

rulatir commented 3 years ago

@rulatir I think I mostly understand what you're saying :) I guess that's one of the reason why Bazel and other build systems based on hash-checks are really against in-tree target outputs.

How does hash depend on where the file is?

Although, wouldn't this problem be solved if build system would check if targets are newer than previously known time, and rebuild if necessary?

I understand that the main benefit of using timestamps is in avoiding the need to maintain a separate database that keeps track of "previously known" version signatures. If you are willing to forego that benefit, why shouldn't those signatures be hashes?

seanmiddleditch commented 2 years ago

A strategy we employ in large distributed content pipelines (not using ninja, but it's all similar concepts) uses hash caches keyed off a triplet like (timestamp; file size; inode).

The chances in a typical and non-malicious environment where a file changes without one of those three changing is incredibly low... and ninja has no business trying to protect itself against malicious manipulation IMO, since users can just manipulate build files and ninja caches at that point; if they want to break things, they will.

I will note that the timestamp used in a hash cache should be considered out of date if it's at all different, not just newer-than. Otherwise rolling back a change to a character in a source file would go undetected.

For those possible cases where the triplet isn't reliable to detect file mutation (I've personally never encountered such a case, but of course I haven't encountered every build environment that exists!), a --rehash flag could force a rehash of every file considered by the build. I've always added those kinds of things to content pipeline tools, though I've never needed them outside of testing/verification of the hash cache itself.

We've in the past considered extending those triplets with additional information, e.g. other file attributes or permissions, as well as adding options to use underlying hash metadata on some esoteric file systems or some SCM backends, but the basic triplet has proven reliable for us so for on all systems we've cared about.

For content pipelines, the hash cashing is essential since the files are often very large and there's often very many of them (far more than there are source code files, generally) and content hashing scales with total file size, of course. I've never measured what the difference would be in source code builds since I've generally just relied on standard ninja/make/msbuild behavior for such things. I imagine the results would be less impactful than with content pipelines, but probably still worth it for large projects like Chromium or the typical AAA game's code base.

burdiyan commented 2 years ago

@seanmiddleditch the problem I find with anything timestamp is that a branch switch forces unnecessary rebuilds. Just touching a file marks the target dirty. Having a proper hashing at least as an opt-in strategy could save many people lots of unnecessary build time, IMHO.

seanmiddleditch commented 2 years ago

@burdiyan that's the whole point of the hashes. The timestamps are just an acceleration for the common case so you aren't needlessly rehashing all your files when nothing's changed on incremental builds.

(edit) I mean the timestamps in what I proposed are just an acceleration; status quo ninja today has the problems you describe, of course.

rulatir commented 2 years ago

@burdiyan that's the whole point of the hashes. The timestamps are just an acceleration for the common case so you aren't needlessly rehashing all your files when nothing's changed on incremental builds.

It's 2022. Version control systems that destroy all assumptions about timestamps are a reality now. THEY are the common case.

seanmiddleditch commented 2 years ago

Just to make it very clear: I am not arguing against hash-based builds. :) I'm just offering that adding a cache in front of the hashes helps maintain very fast local builds. It's a "best of both worlds" approach: hashes when the VCS stomps all over timestamps, and timestamps when files are completely untouched. Win-win.

burdiyan commented 2 years ago

I agree with that! In my wacky scripts I do something similar, i.e. check the timestamp first, and if it's different check the hashes. So indeed win-win. Pretty much always it's faster to hash a file even if it's unnecessary than to run an unnecessary build, IMHO.

jonesmz commented 2 years ago

I recommend that anyone working on this feature take the approach of providing an escape hatch, in the form of an arbitrary command, with variable substitutions provided via the typical ninja substitution rules.

I have a project which I originally wanted to use Ninja for, but decided not to because the "is this target in need of rebuild?" logic was too hard-baked.

Specifically, video encoding of a large collection of video files (e.g. for a Plex server), where the rebuild can take >4+ hours per "build", and being able to resume from a partial build is important.

My solution was to have the "build" store information in linux filesystem extended attributes. Specifically the md5sum and sha256 sum of the file, as two seperate attributes, and then also the md5sum and sha256 sum of the commandline arguments used for the builder (handbreak's cli).

Because of this, I don't care at all about timestamps of the file. They're irrelevant to me. But I do care about being able to control the logic for checking if the file needs to be rebuilt myself.

Simply being able to have ninja hash the source and destination files isn't good enough, Ninja would put those hashes in it's log, and not in the files themselves, making it very easy to find myself needing to spend a month or more of re-encoding files that are already encoded correctly, if my log gets corrupted, or lost.

Edit:

Of course, it's certainly practical to have some behaviors built in. E.g. default the "?rebuild?" check to timestamp. Then also supply "hash" as a strategy that's "baked in" for sake of build speeds.

jcelerier commented 2 years ago

I'm having a case where the build machine periodically resets its time to 00:00 which wreaks havok in ninja + cmake (it seems to make ninja call cmake's configure in a loop), it would be great to have support for hash-based rather than time-based things to overcome this.

mathstuf commented 2 years ago

While this might fix this particular fallout of that root problem, I suspect that you really need to get that fixed because this is not the only thing on a modern system that cares about the system clock being…useful.

jcelerier commented 2 years ago

It really can't - the system is synced to an external PTP clock which behaves like this.

mathstuf commented 2 years ago

I would doubt that such behavior is useful for a "Precision Time Protocol" device. I highly recommend you solve the root cause because this issue isn't moving anywhere near fast enough for your problem (cf. basically nowhere beyond ideation so far).

hadrielk commented 2 years ago

this issue isn't moving anywhere near fast enough for your problem

Don't worry, according to this PTP clock, there'll be plenty of time remaining to fix it.

(sorry, couldn't help it)

jcelerier commented 2 years ago

I highly recommend you solve the root cause

Not much possibility - those are proprietary audio devices which act as a master clock setting to synchronize themselves independently of real-world time ; since the linux driver is in-kernel this sets the whole machine to the same clock.

StevenShw commented 10 months ago

Sad, I want to know the final solution...

maxwell-bland commented 9 months ago

If someone wants to build a system based on (potentially) inotifywait that ties in as an easy to use plugin and for a given target simplifies the output of "-t commands" to just that set of commands necessary to recompile steps dependent on files written to since the last build in the directory, I will personally mail you 5 USD as a bounty. 10 USD and a can of room-temp/warm Chicago beer if you make it so that it "just works", where the next run of a target with the plugin enabled only runs the commands related to the changes.

No idea if this solves all the hashing nonesense or whatever Bazel is calling it nowadays (hermetic ... synonym "esoteric" ... at least they are up front about it (-; ). Anyways, I think inotifywait is a good approach, at least in some simple case. I'll try it once I get some time, but if it gets to that point I'll post results here and the bounty will be gone.

ericriff commented 8 months ago

I'm also interested on this.

ninja-build / ninja

Option to use file characteristics instead of timestamps #1459

929 has an implementation. Even though it is successfully used (in a fork) for thousands of builds daily, it was not considered for merging.

929 is single-threaded and can therefore be slower like `ccache` or other solutions. Also I think that this should be a command line flag instead, so that it doesn't require changes to the build definitions.

ninja-build / ninja

Option to use file characteristics instead of timestamps #1459

929 has an implementation. Even though it is successfully used (in a fork) for thousands of builds daily, it was not considered for merging.

929 is single-threaded and can therefore be slower like ccache or other solutions. Also I think that this should be a command line flag instead, so that it doesn't require changes to the build definitions.

929 is single-threaded and can therefore be slower like `ccache` or other solutions. Also I think that this should be a command line flag instead, so that it doesn't require changes to the build definitions.