Speed up some proofs - Githubissues

Columbus240 commented 2 years ago

The proof-automation was very slow in the files I changed here. By providing some terms manually and being more explicit where the "respects" statements go, which simplify the goal, I could speed it up. Especially the normal tactic (in Relevant.v) was slow and the Local Obligation Tactic in Product.v. But I don't recall the precise circumstances of the latter.

The only time measurements I currently have (but not written down) are from stepping through the proofs and from running make. I could do some proper measurements if you want.

Stilistically, it isn't very pretty. I'm open to suggestions.

Columbus240 commented 2 years ago

Now some measurements, I did three times call time make -f Makefile.coq Structure/Monoidal/Internal/Product.vo when all the other files were already compiled, and wrote down the maximal and minimal "real" times I got.

After the change: Structure/Monoidal/Relevant.v: 14.2s to 14.5s Structure/Monoidal/Internal/Product.v: 6.9s to 7.3s

Before the change: Structure/Monoidal/Relevant.v: 44.1s to 44.9s Structure/Monoidal/Internal/Product.v: 55.5s to 57.5s

What's different: When Coq has to rewrite deep inside a large term, it has to generate a large proof term and do lots of work (somehow). By helping it along, it can generate smaller terms and do less unification/search (or whatever it does).

By writing "smarter" simplification tactics than the current normal tactic etc. such an approach could be automated.

Columbus240 commented 2 years ago

I added some changes to other files to this branch, which I had lying around.

jwiegley commented 2 years ago

I love the idea of speeding up the proofs, I'm just wondering how many of the proofs that were greatly lengthened are fully necessary. With much longer proofs, the maintenance burden increases. Can you share some insights as to the rationale behind the changes you made that required lengthening?

Columbus240 commented 2 years ago

Thanks for your kind consideration and sorry for the lack of explanation in my previous posts.

The proofs I changed (which use unfork) are slow, mostly because of rewrite <- !fork_comp. The sub-terms that are rewritten by this tactic are often "deep inside" a term, and the proof term that witnesses the correctness has to use a lot of fork_respects, compose_respects, split_respects etc. as intermediaries (instances of Proper). (I didn’t precisely determine where Coq spends its time or how the proof terms look) These intermediaries have to be determined by typeclass resolution, maybe that's why its slow?

By explicitly writing down when which Proper should be applied, two steps are quicker of fall away. First, Coq doesn’t have to find the right typeclasses itself. Second, if the same Proper is used multiple times on the same arguments, these calls can be "merged". Concrete example: let f a b = f c d be the theorem to prove, given a = c and b = d. Then first substituting the first equation yields a goal of f c b = f c d and one application of f_proper (or whatever we call it) in the proof term. Then the second substitution yields f c d = f c d by another application of f_proper and finally the goal is solved by reflexivity. On the other hand, my approach first applies f_proper yielding two goals a = c and b = d, which can both be solved by assumption. This certainly constructs a simpler proof-term and the problem which typeclass resolution has to solve is simpler.

But I don’t know how much these effects contribute to the speedup, or whether there is some other mechanism at work.

Now, restating the above: the unfork tactic consists nearly only of different rewrite _ tactics, which all have to go through/create lots of nested functions/Proper. Which is very slow. Doing the rewrites in a different manner (as presented) is quicker.

Sometimes I used transitivity as intermediate steps, where I didn’t figure out how to do all the rewrites in one go.

I agree (a lot) that the way I wrote down the proofs is overly lengthy and impractical. Ideally, we’d package this other way of doing the rewrites in a new tactic (or adapt unfork). The tactics fork_simpl and solve_fork_split contain some work in this direction, because fork_simpl only does rewrite if the rewrite can be done at the top-level/the outermost function of the term.

By the by: Is there some way to run a tactic and create a flamegraph (or similar) of the OCaml code that Coq runs? This could help in determining where exactly the speed-up lies. I don’t know the performance analysis tools of Coq very well. The tools in the makefiles generated by coq_makefile are a bit convoluted to use IMO.

Columbus240 commented 2 years ago

A different approach to solving this problem would be using your Solver, by translating the lemmas into a simpler situation with smaller proof-terms and translating the generated proofs back again. (If I got the idea correctly.) But I don’t know whether it is currently functional or how to use it.

Do you deem the "other way of doing rewrite" viable for further work, i.e. writing a tactic for it? It would be nice if equations of morphisms in cartesian categories (and/or similar) had a strong normal form, so we would only have to bring each side of the equation into normal form and then check for reflexivity. Greatly simplifying the "adapted unfork". Do you know of something like that? Or is unfork the implementation of this?

jwiegley commented 2 years ago

Thanks for the details!

So, I tried merging in your changes locally and running make on a clean working tree, and then reverting back to master and running make on a clean tree again. These are the graphs of the reported real, sys and user times:

For the slow run, with plain master: Slow

For the fast run, with this PR: Fast

At least on my machine, I'm seeing some difference in a few files, but in the large scheme of things it hasn't changed too much overall? I can definitely see three spikes that are shaved down in the faster graph, and a few that are indeed smaller.

Columbus240 commented 2 years ago

I’m not sure how to precisely interpret your graphs, though I get the gist. What is represented on the horizontal axis? I think comparing the compilation times for individual files (or/and only for the files that were changed) might be clearer. The files that weren’t changed should still have similar compilation times (as long as the proof terms of the changed files aren't used).

Do you know to which file(s) the largest spike corresponds (about 500-550 units high at ~65 x-value)?

Ehm, I recall that the Coq developers have some way of benchmarking and comparing how a patch affects build times... I might look into that.

jwiegley commented 2 years ago

The graph is seconds against individual files. You're right that I should confine it to the files changed.

jwiegley commented 2 years ago

I'm going to start adapting and incorporating these changes in PR #40. Thank you for laying the groundwork, @Columbus240!

jwiegley commented 2 years ago

I've merged in the obvious improvements; I think the rest represent too much additional complexity (and thus, future maintenance burden) to be worth the performance improvements. I'll see if I can debug things further and find more systemic changes that could help things move faster. Thank you for all your work and input!

Columbus240 commented 2 years ago

I'm happy that you could do something with this.

jwiegley / category-theory

Speed up some proofs #22