target / theta-idl

Define communication protocols between applications using algebraic data types.
Other
45 stars 9 forks source link

Frequent rebuilds due to use of autogenerated Paths_theta module #37

Closed istathar closed 2 years ago

istathar commented 2 years ago

We spend a surprising amount of time watching theta recompile when it's already built and cached. The cause of this is something I've seen before, which is the use of the "autogen" Paths_whatever modules. In this case, the library is using it for the version constant and I think to embed a path or two.

You can see it here: this is the third or fourth time in a series of builds that the same library gets rebuilt, and it's pretty clear that Paths_theta is getting recompiled, which then triggers the modules containing Template Haskell to re-evaluate.

theta                        > Preprocessing library for theta-1.0.0.0..
theta                        > Building library for theta-1.0.0.0..
theta                        > [14 of 23] Compiling Paths_theta
theta                        > [16 of 23] Compiling Theta.Error [TH]
theta                        > [17 of 23] Compiling Theta.Target.Avro.Error [TH]
theta                        > [18 of 23] Compiling Theta.Target.Avro.Types [TH]
theta                        > [19 of 23] Compiling Theta.Target.Python [TH]
theta                        > [21 of 23] Compiling Theta.Target.Haskell.Conversion [TH]
theta                        > Preprocessing executable 'theta' for theta-1.0.0.0..
theta                        > Building executable 'theta' for theta-1.0.0.0..
theta                        > [8 of 8] Compiling Paths_theta
theta                        > Linking .stack-work/dist/x86_64-linux-tinfo6/Cabal-3.2.1.0/build/theta/theta ...
theta                        > copy/register
theta                        > Installing library in
theta                        > Installing executable theta in
theta                        > Registering library for theta-1.0.0.0..

There are a few different techniques for getting the version number in, none of them ideal; indeed, the Paths_whatever thing would be great if it didn't trigger package cache invalidation and rebuilds all the time.

I'd like to ask someone on my team to go and flush this out, but before I do I wanted to check in with you and see if you would consider this an improvement, or if there's something about the Paths_theta here that is really important and so me blundering in trying to "fix" it wouldn't be helpful at all.

Have you experienced anything along these lines? I'm aware you use different build tooling than we do; this is possibly a corner case because we're including it from source rather than getting it from Hackage but there are several other packages vendored in (notably amazonka!) that aren't demonstrating this rebuild behaviour.

TikhonJelvis commented 2 years ago

Interesting. I haven't seen that problem, but Nix manages builds and caching in a different way than Stack which might explain it.

Do you know what is causing Paths_theta to be rebuilt? Is it picking up on changes to files in the repo that shouldn't matter?

I initially used Paths_theta because it seemed to be the "correct" way to refer to the test data files included with the package. Picking up the package version was just a convenience. If it fixes the rebuilding problem, I'd be fine managing the version number in some other way. Looks like you're not rebuilding the test suite anyhow, so I assume we can leave that unchanged.

istathar commented 2 years ago

Do you know what is causing Paths_theta to be rebuilt?

I think every time the Cabal library does its initialization pass it updates that file, so I'm guessing that it's getting refreshed enough (even if just timestamp) for GHC to think it needs to be recompiled. Cabal in turn is getting re-run probably by each [executable] package that depends on it.

some other way

They really are worse; I've used CPP, TemlateHaskell, Cabal configurations, and several other tricks to shlurp the version number in. Anyway, we'll take this as an open issue and see if we can come up with something minimally invasive that stops the rebuilds!

TikhonJelvis commented 2 years ago

I thought that timestamps might be part of the problem. I don't know exactly how it works with Nix, but I believe it does something to control timestamps to maintain reproducibility.

Honestly, I could live with something moderately hacky like a VERSION file and a CI check to make sure it's up to date with the cabal file.

TikhonJelvis commented 2 years ago

Oh, realized an even easier version of my hacky idea: as long as depending on Paths_theta in the test suite isn't causing problems, I could hardcode packageVersion in src/Theta/Versions.hs, then have a plain unit test that would fail if it isn't updated alongside the .cabal file.

That's a bit manual but in a way that isn't a real problem, and it's an easy change to make. Would only depending on Paths_theta from the test-suite actually solve the problem you're seeing?

istathar commented 2 years ago

That would definitely solve it, yes.

TikhonJelvis commented 2 years ago

Just merged the change Paths_theta change into stage. Hope this fixes the problem for you.

I'll do a real "release" at some point with a version bump to the .cabal file and start uploading the package to Hackage. I cleaned up all the warnings from cabal check recently, but I'll also want to come up with reasonable version bounds for all of my dependencies, and that seems like it'll take some effort.

istathar commented 2 years ago

Is stage now the default branch? We had forked main and missed the fix, I think?

TikhonJelvis commented 2 years ago

Oh yeah, I made stage the default branch a little while ago—I wanted to have a place to merge PRs and run additional tests. My current thought is to treat main as the place for "official" releases, which will require a version bump, publishing packages... etc. I haven't done a release like that yet though.

main will now be stable and should consistently be in a working state. stage will change more often and will occasionally break for things I don't regularly test, which includes the Stack build. When that happens CI on stage will fail and I'll get an email, so it shouldn't stay broken for long. I want to keep the "normal" CI around 3 minutes so I'm moving slower checks to stage-only—going to move the randomized cross-language tests there next.

I am not sure that this is the best approach. I want to have a separation between merging PRs and "releasing", and I want a place to run CI checks that take longer. If you have some suggestions about this, I'd love to hear them.

But also, I only merged the Paths_theta fix a few hours ago, so chances are you wouldn't have had it either way.

istathar commented 2 years ago

Ok, I've pulled from stage and run a test. All good here. Thanks!

[as an aside, and worthy of another issue, in a parallel universe I might suggest putting the library in theta/ and the actual Theta binary in theta-tool/ or something (because otherwise people depending on the library have the binary installed in their local work dir every time), but that's a minor detail and not nearly as impactful as the main change above was. This is mostly just disk space]

My current thought is to treat main as the place for "official" releases

There's no right answer, hey.

The alternative is that main is always the (possibly unstable) tip and release or release-0.7 or whatever is where forks are made prior to actually releasing. The reasons are thin, but I've biased towards this in my own libraries only because if someone comes along and wants to contribute they probably need to be contributing against the latest code. Otherwise [and only because forking & checking out defaults to main unless they're careful] they end up not realizing that you've moved on or even fixed the issue they're talking about.

You've made good arguments. I actually look forward to hearing what you think after you've tried it for a while.

TikhonJelvis commented 2 years ago

Otherwise [and only because forking & checking out defaults to main unless they're careful] they end up not realizing that you've moved on or even fixed the issue they're talking about.

That's a good point. I made stage the default branch on GitHub, so it should be checked out automatically, but sounds like the naming is surprising. Now that I think about it, mainrelease makes more sense than stagemain. There's no real "staging" environment involved here and release makes for a clearer naming scheme if I want version-specific branches (in case I need to apply a bug fix to several previous major/minor releases).

[as an aside, and worthy of another issue, in a parallel universe I might suggest putting the library in theta/ and the actual Theta binary in theta-tool/ or something (because otherwise people depending on the library have the binary installed in their local work dir every time), but that's a minor detail and not nearly as impactful as the main change above was. This is mostly just disk space]

Good idea, I'll create an issue for it.