Closed ndmitchell closed 5 years ago
I hope this can also help make progress with this issue: https://github.com/snowleopard/hadrian/issues/217.
Ideally, I need a way to implement something like this:
-- Remember a value associated with a key and force a rebuild if the value has changed.
remember :: Int -> Int -> Action ()
remember key value = ...
checkArgsHash :: Target -> Action ()
checkArgsHash target = do
args <- interpret target getArgs
remember (hash target) (hash args)
At the moment we implement checkArgsHash
roughly as follows:
checkArgsHash :: Target -> Action ()
checkArgsHash target = do
_ <- askOracle $ ArgsHashKey target :: Action Int
return ()
-- Oracle for storing per-target argument list hashes
argsHashOracle :: Rules ()
argsHashOracle = void $
addOracle $ \(ArgsHashKey target) -> hash <$> interpret target getArgs
Storing all complete Target
's in Shake's database is very inefficient, yet we have to do this currently.
Questions and comments:
Action Bool
dependencies argument actually do? Why is it not just a Bool
?changed
/ built
values, so that rules could determine which dependencies changed?changedDependencies
vs changedStore
vs changedValue
? Why not a single changed
Bool
? It seems like overoptimization if it's just to avoid a database write.(new)type Value = Value BS.ByteString
, instead of passing around BS.ByteString
directly?Typeable value
constraint because Shake still caches that while running. Why not use Dynamic
or Any + unsafeCoerce
?shakeAssume = AssumeClean
for that, but I'm not sure the semantics match completely.fooxxx~1.dos
and friends). Is that possible?I'll also note that this is really close to what I have in my devel
branch; I just noticed that my analyse
function can go away and be part of rule execution.
Thanks for the comments!
Target
unless you can enumerate all Target
values in advance, hash them, form a Int -> Target
map, and then deserialize that way.Action Bool
. A The idea is that the output file might have disappeared and thus you never bother checking if your children are up to date or not, you rebuild anyway. Essentially it's giving you the ability to not bother checking your children.Action (Maybe Dynamic)
instead of Action Bool
to indicate what child changed. I don't see what benefits it would give.apply
and then needs to look up the right builtin rule, that requires the Typeable
so I can put the BuiltinRule things in a map.need
will call normalise
first, but that won't sort out short names.UserRule
and matchUserRule
, then it's just a predicate, but it opens the possibility of doing fancy optimisations for matching in the future.Great that we seem to be converging on an answer!
the ability to not bother checking your children.
I don't think this is actually an ability. Suppose A depends on B unconditionally. If B changes, and A does not check its dependencies, then A is in an inconsistent state. And if A changes, you do not get any optimization by skipping the check for B, because A will call need B
regardless, to add it as a dependency.
For conditional dependencies, all that matters is that Shake checks them in the same order as the rule did; i.e. need a >> need b
means that Shake checks a
and all its dependencies before checking b
, because changing a dependency can only change later dependencies.
You mentioned changing Shake to do the storedValue
check after the dependencies (#427), using Bool
instead of Action Bool
is similar.
Could you expose access to the changed / built values
I was thinking of a version of apply
that returned the changed / built values. It's not particularly difficult to implement, even in the existing framework.
Shake gets the TypeRep value in apply and then needs to look up the right builtin rule,
That explains Typeable key
, but not Typeable value
, unless Shake has started doing disambiguation by value type (currently it only uses the key). But that wouldn't make much sense since the previous value is a ByteString
.
Oh, and there's no need for ShakeOptions as an argument, it's in the Action monad so getShakeOptions
suffices. Similarly, the user rules don't need to be explicitly passed if they're stored in Global
.
changedDependencies says whether to use the dependencies from this run, or the previous run
But this only makes sense if there is a previous run. How about the dependencies of the previous run are passed in, and the rule can decide to either pass them through or create a new list?
@snowleopard - with this approach you can eliminate the value and just store a hash. However, I don't see how you can eliminate the
Target
unless you can enumerate allTarget
values in advance, hash them, form aInt -> Target
map, and then deserialize that way.
@ndmitchell But I think I've shown the solution already, or is it wrong?
All I need is this function to be implementable using Shake's rules:
-- Remember a value associated with a key and force a rebuild if the value has changed.
remember :: Int -> Int -> Action ()
remember key value = ...
With this function we can implement checkArgsHash
without storing Target
's in Shake's database or having a global Int -> Target
map as follows:
checkArgsHash :: Target -> Action ()
checkArgsHash target = do
args <- interpret target getArgs
remember (hash target) (hash args)
There are several cases to consider:
remember
forces a rebuild, because the value associated with key hash target
has changed (there was no value, but now we have hash args
).remember
does not force a rebuild, because the key hash target
has the same associated value hash args
stored from last run.args
: remember
forces a rebuild, because the value associated with key hash target
has changed (hash args
has changed because args
has changed).target
: remember
forces a rebuild, because the value associated with key hash target
has changed (there was no value, but now we have hash args
). This is similar to the first run case, but here we also need to forget/garbage collect the old key corresponding to the old target. But I think Shake does this automatically anyway.So, if I haven't made a mistake above, we don't need to store Target
's (only their hashes) in the Shake database and we also don't need any global Int -> Target
dictionary.
P.S.: I am pretty sure remember
can already be implemented using stamp files, but that is a lot of overhead, of course. Ideally, I'd like to store this information in Shake's database directly (suitably wrapping Int
key/value pairs into newtypes).
My suggestions:
changedValue
is kind of confusing; it may be useful to set it to True
even though the actual resultValue
hasn't changed; you see this in your sample File
implementation where resultValue
is always ()
.Now, let's see if this solves #388. I'll consider rules which have the map (FilePath, FilePath)
to file outputs, BUT if a flag in shakeOptions
is set then we consider only the first FilePath
.
filesToo = addBuiltinRule $ \opts ask ((x1, x2) :: (File, File)) old check -> do
let buildSecond = shakeBuildSecond opts -- the flag which toggles if we consider x2
-- this part looks annoying to write, would like some helper functions...
rebuild <- case old of
Nothing -> return True
Just bs -> do
let (b1, b2) = decode bs
r1 <- getFileInfo opts x1
if r1 /= b1
then return True
else do
if buildSecond
then fmap (/=b2) (getFileInfo opts x2)
else return False
if not rebuild
then BuiltinInfo
{changedDependencies = False
,changedStore = False
,resultStore = fromJust old
,changedValue = False
,resultValue = decode $ fromJust old}
else -- blah blah blah, I'm convinced
OK, so I am convinced this does solve the problem. Does look annoying to write these functions though.
Replying to @Mathnerd314
"the ability to not bother checking your children." And if A changes, you do not get any optimization by skipping the check for B, because A will call need B regardless.
I agree this is the same as #427. If we decide in #427 that it's reasonable to check stored value after, then this should be a Bool. Let's discuss that particular issue there.
I was thinking of a version of apply that returned the changed / built value.
What do you imagine the use case for that is? Certainly possible to do, I just can't think why it's useful - usually more fine grained rules solves the problem in a simpler way.
That explains Typeable key, but not Typeable value
The person calling apply
implicitly supplied a key and value type. The Typeable
value is required to check they match and thus that the coercion at the other end will work. I could use Any and unsafeCoerce, but that means an ill-typed apply
would segfault.
There's no need for ShakeOptions as an argument, it's in the Action monad
The builtin rule will be pre-applied to the first two arguments, since given the Skip/Ignore/Rebuild flags in the options, it may be able to precompute some faster data structure to look up those values. Similarly, given the matches it may be able to precompute something. As a result, it's not able to run Action stuff at that point.
How about the dependencies of the previous run are passed in, and the rule can decide to either pass them through or create a new list?
That's certainly a design decision I considered, but requires a value-level representation of dependencies, which otherwise isn't necessary.
@snowleopard I still don't see how this works. You need to define an operation that given a key says if the value is still valid or not. From the hash of a target you can't compute the expected result, so you have to store the whole target.
I'd love to see some kind of prototype using stamp files. If you can do it with stamp files, I'm sure this interface will let you do it more efficiently without.
@ezyang:
Make a data structure for the input arguments.
Probably not a bad idea. Would also make it explicit that there are two staged input arguments, answering your point from 3.
I really don't understand what is going on here: https://github.com/ndmitchell/shake/blob/master/src/Development/Shake/Experiment/Rules.hs#L161
Phony rules are encoded as 0 byte storage, and are always considered to have rebuilt. It's the mirror of https://github.com/ndmitchell/shake/blob/master/src/Development/Shake/Experiment/Rules.hs#L180
Instead of having a changed Bool and then the value of type a, can't we define something of type Changed a? Then if it hasn't changed, we don't even have to provide a value, thus preventing the class of bugs where we claimed something didn't change, but we returned something different.
Certainly this opens up a class of bugs, but the idea is that this is the most efficient interface possible, and then sugaring up to some kind of Changed value is quite reasonable.
For changedStore
the motivation was that if we allow the user to pass back Unchanged
we are forced to keep the value storage around until the end of the rule. With the current formulation we can throw it away, and the rule only keeps it if it needs to. Perhaps a micro-optimisation.
For changedValue
the idea is to encapsulate the equalValue
method of the Rules
class and all the dodgy Eq
instances for things like AlwaysRerunA
. You produce a value, and separately you say if you should consider the children to be dirty. I guess changedValue
should really be childrenDirty
or similar - which (at a first approximation for some rules) happens to be if the value has changed.
OK, so I am convinced this does solve the problem. Does look annoying to write these functions though.
Yep, this is the low-level substrate which hopefully we can sugar up to something easy. The fact it's possible is the thing to dwell on :smile:.
I've implemented most of this in my branch; it uses Binary / ByteString.Lazy instead of a new Encoding class, and there are other differences, but the general sense is there. My BuiltinRule
type is here, and Rules.File
is here. It's only slightly longer than the original, primarily due to the addUserRule
calls. And Oracle is gone! (it just uses builtin rules now)
I think I get what @snowleopard 's asking; it's possible in current Shake with a new cache primitive that uses a bimap. You would have
-- Core.hs:
newBidirectionalCacheIO :: (k -> Action v) -> IO (k -> Action v, v -> IO (Maybe k))
-- Rules/TargetChecker.hs
module TargetChecker(makeTargetChecker) where
newtype TargetHash = TargetHash ...
hash :: Target -> TargetHash
hashResult :: whatever -> HashResult
makeTargetChecker :: Rules (Target -> Action ())
makeTargetChecker = do
(targetToHash, hashToTarget) <- liftIO $ newBidirectionalCacheIO hash
askOracle <- addOracle $ \(TargetHash thash) -> do
Just target <- liftIO $ hashToTarget thash -- always successful since it can only be called from below
hashResult <$> interpret target getArgs
return (\target -> do
thash <- targetToHash target
_ <- askOracle
return ())
So really it belongs in a separate issue. Although, with the new rule API, having a makeInterpreter :: (Typeable e, Hashable e) => Rules (Target -> Expr e -> Action e)
that similarly tracks dependencies would be possible (using Key-in-Key and the resultStore
/ resultValue
distinction).
a version of apply that returned the changed / built value.
What do you imagine the use case for that is?
Pluto uses it in their C compile rule. Don't look too closely at it because I think it's buggy, but the general idea is sound. The rule is formulated as many-to-many (a single rule that compiles all files), but it only recompiles the C files that changed. So a directory that has a.c b.c c.c
might run a command like gcc -cvH a.c c.c
, if only a and c have changed. In general, it should allow implementing complicated dependency logic like the second graph in https://github.com/ndmitchell/shake/issues/382#issuecomment-182145623.
Awesome @Mathnerd314! I'll take a look and see exactly what you've done. Any reason for going via Binary instead of Encoder? My benchmarks have shown Binary to be a bottleneck, hence the desire to switch to Encoder and avoid the performance penalty.
Any thoughts of anything that did/did not work particularly well? Any hidden traps we hadn't considered?
I used Binary because it was already there, whereas using Encoder would require writing a new class and changing more files. And Binary is advertised as 'efficient'; in theory it should optimize to something close to your pointer arithmetic. In practice, well... it's probably worth benchmarking again, because the function layout is significantly different and so GHC may find better optimizations or lose old ones. But, I expect that Shake will need to be significantly rearchitected again to support file-system watching, so I don't think it's needed yet.
In terms of changes, it was relatively simple, although tedious; your proposal is really a series of changes (API's become monomorphic, Rule class goes away, etc.) so they decomposed well. I still haven't run the testsuite, so there could be some weird serialization or type coercion failures. But most of the changes were just moving around code, rather than a particularly grand redesign (user rule matching being a mild exception). I didn't implement the whole API though; it's still tweaks here and there.
Benchmarking shows Binary to be a bottleneck. My intention is to present a Binary interface to the outside world (probably), but use Encoder and implement Encoder using Binary for user types. Binary is nowhere near my bytestring code, partly because it can't be - it starts off with lazy bytestrings, has to be able to consume a prefix (my encoders rely on the length) etc. See https://www.fpcomplete.com/blog/2016/03/efficient-binary-serialization for benchmarks - although I came to the conclusion Encoder would be necessary before that blog post. That said, it's certainly reasonable to get it all working on top of Binary first, then as a second step move to Encoder - no reason to break everything in one go.
@Mathnerd314 - how do you suggest I proceed? Are the patches in your branch clean enough to move over wholesale? Should I diff our trees and move diffs across? Should I reimplement but using yours as a guide?
After looking at Binary more, I agree it's a poorly designed API. But it's more work to implement a new API, and I don't think Encoder is sufficient. So now I'm writing my own serialization API, using lenses... (on a local branch)
As far as proceeding, at this point it's probably less work to merge my branch than to reimplement it or move patches one-at-a-time. The diff is relatively readable, except for the Core / Core2 and Database/Database2 splits, which can be worked around with
git diff `git ls-tree origin/master src/Development/Shake/Core.hs` src/Development/Core2.hs
git diff `git ls-tree origin/master src/Development/Shake/Database.hs` src/Development/Database2.hs
I recommend that you go through the diff and comment any changes you don't like; I'll fix/reverse those in new commits, then you can merge it, either as a squashed commit (there's lots of half-baked commits and false starts) or with a normal git merge.
I think I get what @snowleopard 's asking; it's possible in current Shake with a new cache primitive that uses a bimap. You would have
@Mathnerd314 I have to admit I don't understand your solution, but I'll meditate on it when I'm back to fixing argsHashOracle
. If you could integrate it into Oracles.ArgsHash
in a PR that would be awesome :)
I was curious what the status of this was, so I looked through master:
Thanks for that summary - a good todo list and I confirm all those things are still in progress. A couple of questions:
Current plan is most definitely Database refactoring. That code is important and complex, so want to simplify it then get rid of the execute/build/check distinction. After that everything else looks reasonably straightforward.
How are my user rules different?
You have user rules as an extra argument to execute:
executeRule :: (forall a . Typeable a => Proxy a -> UserRule a) -> key -> Action value
I didn't pass user rules at all; I stored user rules as part of Action's Global structure, and added two functions (userRule & getUserRules) to retrieve them.
For exposed via Cabal, you mean FileA etc?
The general expectation of internal modules is that they are exposed, so that you can actually use them in the rare cases where this is necessary. See for example Data.Bytestring.Internal
. But your internal modules are still in other-modules.
You have user rules as an extra argument to execute:
Ah yes, there are lots of argument details, I copied about 90% of yours and then differ in this part. The reason is I want to the builtin rules to be able to ask for all the user rules in advance then run advanced/time-consuming optimisations on them once, and then use them lots of times. That necessitates the rules being available during builtin rule construction rather than in Action. I don't yet do any such optimisations, but the hope is that if profiling ever shows up rule matching I could compile a single finite state automaton from all rules and match them in time linear to the length of the path, rather than linear to the number of rules.
The general expectation of internal modules is that they are exposed
I appreciate this is the standard pattern in Haskell, but I find it quite distasteful - it massively increases the API, adds a lot of ambiguity around semantic versioning, and means I can't easily figure out what people are doing with Shake. Instead I prefer to be responsive and if people need internals exposing for good reasons then expose them properly and to everyone, and ideally very quickly. I'm certainly going to expose things like FileA in some form by the end, but not just everything in Internal. As an example of where exposing internals goes wrong, consider Text, which has 27 internal modules including things like Data.Text.Internal.Builder.Int.Digits which consists of a longish ByteString with no clear semantics...
As a concrete example of in-advance rules, for file rules, I sometimes generate a Set data structure if there are lots of literal paths to match. At the moment that has to be one separate Set per |%>
. With in advance rule matching I could instead make it a Map for nearby equal priority rules not under alternative.
Just backing up my in progress notes, which I've been leaving in an unsaved text buffer for a few weeks, but now have to reboot: https://gist.github.com/ndmitchell/904306020655edb3b56caadc286ba8ee (I'm expecting them to be unintelligible to anyone but me!)
It seems quite readable, here's my notes for comparison: https://gist.githubusercontent.com/Mathnerd314/e9e9f0d667270eaf37a9bfced53c8b58/raw/4ad976d083f102107133ae640722ae8c0bb5a343/gistfile1.txt
I switched to running storedValue in the thread pool. With -j1 it goes twice as slow for checking 1000's of filetimes. At -j4 it goes slightly faster. The bottleneck is reported less in pool and more in the waiting/rendezvous code, which might need some microoptimisation. However, such optimisation can wait - it's certainly in a ball park of fast enough. Progress continues towards the end goal.
How is this going? Still waiting so I can port my code to a newer version of Shake 8)
Progress continues - I think the addBuiltinRule
now has the final signature. I think the file rules will continue to change, so that writing combinations of file rules is easier.
What's the state of this issue? I still would like to implement some kind of dirty flag for #289, but in a comment to #289, you said that such a flag is kind of blocked by this issue here.
I'm 80% through the implementation and hope to finish end of Jan.
I think the API is completely rewritten now, so this can be closed.
Various tickets have lead me to the conclusion that I should redesign the Rules class. This ticket is to discuss that endeavour. Notes:
*.exe
files.The code is all under https://github.com/ndmitchell/shake/tree/master/src/Development/Shake/Experiment. Interface.hs is the proposed API, and Rules.hs is my implementation of equivalent existing rules under that API, mostly stubs.
CC @ezyang and @Mathnerd314 whose requests prompted this redesign.