Cache package metadata parsing

snowleopard commented 6 years ago

When I run Hadrian on my machine, the first minute or so is spent on parsing Cabal files:

$ ./build.bat -j --flavour=quickest --integer-simple --verbose
hadrian-0.1.0.0: configure (exe)
hadrian-0.1.0.0: build (exe)
hadrian-0.1.0.0: copy/register
Log files have been written to: C:\msys\home\nam83\ghc\hadrian\.stack-work\logs\
| KeyValue oracle: reading 'hadrian/cfg/system.config'...
Building stage2
| Building Programs:  ghc, ghc-pkg, hsc2hs, unlit, haddock, runhaskell, hpc
| Building Libraries: binary, Cabal, ghc, ghc-boot, ghc-boot-th, ghci, hpc, mtl, parsec, template-haskell, text, transformers, array, base, bytestring, containers, deepseq, directory, filepath, ghc-compact, ghc-prim, haskeline, integer-simple, pretty, process, rts, stm, time, xhtml, Win32

| CabalFile oracle: reading 'libraries/binary/binary.cabal' (Stage: stage1)...
| CabalFile oracle: reading 'libraries/Cabal/Cabal/Cabal.cabal' (Stage: stage1)...
| CabalFile oracle: reading 'compiler/ghc.cabal' (Stage: stage1)...
...

This used to be cached in package-data.mk files substantially speeding up subsequent builds, but now this seems to be a constant cost for any build, unless I'm mistaken.

(The build soon fails on my machine due to an unrelated issue, so I can't try a zero build, but I assume there is no caching at the moment.)

@angerman @alpmestan Do you have any plans on how to cache this?

alpmestan commented 6 years ago

@snowleopard I think both of us were hoping there was caching going on here. Maybe it happens within a build but not accross builds? I'm really not sure -- the oracle system comes with caching no? So if we ask twice for the Cabal data of a package, hadrian will only hit the .cabal file once, the first time, right?

(I'm talking about these oracles: https://github.com/snowleopard/hadrian/blob/master/src/Hadrian/Oracles/TextFile.hs#L101 -- "instantiated" below in textFileOracle.)

snowleopard commented 6 years ago

@alpmestan I think the newCache combinator only caches computations within a single build.

I remember that Shake now provides some new/nicer infrastructure for storing arbitrary information in the Shake database. @ndmitchell Could you give a pointer?

alpmestan commented 6 years ago

@snowleopard Ooooooh, so that's the catch. If Neil points us to something that persists accross builds, then all we have to do is change newCache to that to perform the parsing just once, when we first run hadrian. (or once per build-root maybe? where does shake's db live?)

snowleopard commented 6 years ago

(A side note after thinking about such across-build caching: I think the ability to support 'virtual' cached build artefacts is what distinguishes traditional build systems from self-adjusting computations.)

ndmitchell commented 6 years ago

Why not persist it in a real file? Storing it in the Shake DB means Shake has to cache it in memory (not necessarily desirable) and means that it's harder to debug. Shake could easily implement something that cached persistently, but it hasn't yet.

snowleopard commented 6 years ago

@ndmitchell It's not too difficult to store it by ourselves, but it seems that coming up with a filename, parsing and serialisation is unnecessary complexity that can be delegated to a build system.

ndmitchell commented 6 years ago

Coming up with a unique name (whether that is a type name or a filename) is still required. Serialising/parsing is still required. You really don't eliminate much complexity...

snowleopard commented 6 years ago

I thought we could have a combination of newCache and askOracle:

newPersistentCache :: (PersistentCacheResult q ~ a, ShakeValue q, ShakeValue a)
                   => (q -> Action a) -> Rules (q -> Action a)

Now we could instantiate this with q = Package and a = PackageMetadata, and cache the existing function parseCabal :: Package -> Action PackageMetadata in Shake's database.

With this approach we don't need to invent a name for an intermediate file, and we don't need to parse or serialise PackageMetadata, because this can be done in a generic way by Shake.

P.S.: It appears that we do need to keep all PackageMetadata in memory anyway, so there is no overhead in asking Shake to keep it in memory for every build. Of course, the Action should be rebuilt if its dependencies have changed.

ndmitchell commented 6 years ago

Package is the name for the intermediate keys that you just invented... Certainly no reason this couldn't be implemented in Shake, and moreover you can implement it yourself on the outside of Shake, but it isn't baked in yet.

snowleopard commented 6 years ago

moreover you can implement it yourself on the outside of Shake

Yep, we should give this a try. Is this the right starting point?

https://hackage.haskell.org/package/shake-0.16.3/docs/Development-Shake-Rule.html

ndmitchell commented 6 years ago

Yep, addBuiltinRule is the place to start from. Feedback on the docs etc most welcome!

ndmitchell commented 6 years ago

@snowleopard I decided to do this myself in Shake, so just wait for the next release...

snowleopard commented 6 years ago

@ndmitchell Cool, thank you!

ndmitchell commented 6 years ago

Just pushed, see addOracleCache. I guess you need a release to actually use it?

snowleopard commented 6 years ago

That was fast! :) Yes, I think a release would be more convenient. But we could start playing with the new feature locally pointing Stack to Shake's HEAD, so it's not too urgent.

ndmitchell commented 6 years ago

0.16.4 released with addOracleCache.

snowleopard commented 6 years ago

Awesome, that makes this issue easy :)

snowleopard commented 5 years ago

Fixed in #692.

snowleopard commented 5 years ago

@ndmitchell Switching to addOracleCache was very easy -- thanks for adding this feature!

However, I suspect that I can make some further simplifications. Here is an example:

    packageData <- newCache $ \package -> do
        let file = pkgCabalFile package
        need [file]
        putLoud $ "| PackageData oracle: parsing " ++ quote file ++ "..."
        parsePackageData package
    void $ addOracleCache $ \(PackageDataKey package) -> packageData package

We used to have addOracle here, and now we switched to addOracleCache. But do we still need newCache? Presumably addOracleCache already does the caching, so I can rewrite the above simply to:

    void $ addOracleCache $ \(PackageDataKey package) -> do
        let file = pkgCabalFile package
        need [file]
        putLoud $ "| PackageData oracle: parsing " ++ quote file ++ "..."
        parsePackageData package

Should I?

ndmitchell commented 5 years ago

Yes, simplify! I suspect the newCache there never helped anyway - addOracleCache is caching between runs (persistent), but normal addOracle caches within a single run, as does addOracle, so it probably didn't help.

snowleopard commented 5 years ago

@ndmitchell Thanks! Done in #694.

What about cases like this?

    kv <- newCache $ \file -> do
        need [file]
        putLoud $ "| KeyValue oracle: reading " ++ quote file ++ "..."
        liftIO $ readConfigFile file
    void $ addOracleCache $ \(KeyValue (file, key)) -> Map.lookup key <$> kv file

Here newCache is presumably useful, since we don't want to parse the key-value file multiple times when answering queries for individual keys. Is there a way we could simplify this?

ndmitchell commented 5 years ago

That's the right way of writing it in Shake. You can of course abstract it away if it's a common pattern with function abstraction, like normal.

snowleopard / hadrian

Cache package metadata parsing #550