Closed Columbus240 closed 2 years ago
Now some measurements, I did three times call time make -f Makefile.coq Structure/Monoidal/Internal/Product.vo
when all the other files were already compiled, and wrote down the maximal and minimal "real" times I got.
After the change: Structure/Monoidal/Relevant.v: 14.2s to 14.5s Structure/Monoidal/Internal/Product.v: 6.9s to 7.3s
Before the change: Structure/Monoidal/Relevant.v: 44.1s to 44.9s Structure/Monoidal/Internal/Product.v: 55.5s to 57.5s
What's different: When Coq has to rewrite deep inside a large term, it has to generate a large proof term and do lots of work (somehow). By helping it along, it can generate smaller terms and do less unification/search (or whatever it does).
By writing "smarter" simplification tactics than the current normal
tactic etc. such an approach could be automated.
I added some changes to other files to this branch, which I had lying around.
I love the idea of speeding up the proofs, I'm just wondering how many of the proofs that were greatly lengthened are fully necessary. With much longer proofs, the maintenance burden increases. Can you share some insights as to the rationale behind the changes you made that required lengthening?
Thanks for your kind consideration and sorry for the lack of explanation in my previous posts.
The proofs I changed (which use unfork
) are slow, mostly because of rewrite <- !fork_comp
. The sub-terms that are rewritten by this tactic are often "deep inside" a term, and the proof term that witnesses the correctness has to use a lot of fork_respects
, compose_respects
, split_respects
etc. as intermediaries (instances of Proper
). (I didn’t precisely determine where Coq spends its time or how the proof terms look) These intermediaries have to be determined by typeclass resolution, maybe that's why its slow?
By explicitly writing down when which Proper
should be applied, two steps are quicker of fall away. First, Coq doesn’t have to find the right typeclasses itself. Second, if the same Proper
is used multiple times on the same arguments, these calls can be "merged". Concrete example: let f a b = f c d
be the theorem to prove, given a = c
and b = d
. Then first substituting the first equation yields a goal of f c b = f c d
and one application of f_proper
(or whatever we call it) in the proof term. Then the second substitution yields f c d = f c d
by another application of f_proper
and finally the goal is solved by reflexivity.
On the other hand, my approach first applies f_proper
yielding two goals a = c
and b = d
, which can both be solved by assumption.
This certainly constructs a simpler proof-term and the problem which typeclass resolution has to solve is simpler.
But I don’t know how much these effects contribute to the speedup, or whether there is some other mechanism at work.
Now, restating the above: the unfork
tactic consists nearly only of different rewrite _
tactics, which all have to go through/create lots of nested functions/Proper
. Which is very slow. Doing the rewrites in a different manner (as presented) is quicker.
Sometimes I used transitivity
as intermediate steps, where I didn’t figure out how to do all the rewrites in one go.
I agree (a lot) that the way I wrote down the proofs is overly lengthy and impractical. Ideally, we’d package this other way of doing the rewrite
s in a new tactic (or adapt unfork
). The tactics fork_simpl
and solve_fork_split
contain some work in this direction, because fork_simpl
only does rewrite
if the rewrite can be done at the top-level/the outermost function of the term.
By the by: Is there some way to run a tactic and create a flamegraph (or similar) of the OCaml code that Coq runs? This could help in determining where exactly the speed-up lies. I don’t know the performance analysis tools of Coq very well. The tools in the makefiles generated by coq_makefile
are a bit convoluted to use IMO.
A different approach to solving this problem would be using your Solver, by translating the lemmas into a simpler situation with smaller proof-terms and translating the generated proofs back again. (If I got the idea correctly.) But I don’t know whether it is currently functional or how to use it.
Do you deem the "other way of doing rewrite
" viable for further work, i.e. writing a tactic for it? It would be nice if equations of morphisms in cartesian categories (and/or similar) had a strong normal form, so we would only have to bring each side of the equation into normal form and then check for reflexivity. Greatly simplifying the "adapted unfork
". Do you know of something like that? Or is unfork
the implementation of this?
Thanks for the details!
So, I tried merging in your changes locally and running make
on a clean working tree, and then reverting back to master and running make
on a clean tree again. These are the graphs of the reported real, sys and user times:
For the slow run, with plain master:
For the fast run, with this PR:
At least on my machine, I'm seeing some difference in a few files, but in the large scheme of things it hasn't changed too much overall? I can definitely see three spikes that are shaved down in the faster graph, and a few that are indeed smaller.
I’m not sure how to precisely interpret your graphs, though I get the gist. What is represented on the horizontal axis? I think comparing the compilation times for individual files (or/and only for the files that were changed) might be clearer. The files that weren’t changed should still have similar compilation times (as long as the proof terms of the changed files aren't used).
Do you know to which file(s) the largest spike corresponds (about 500-550 units high at ~65 x-value)?
Ehm, I recall that the Coq developers have some way of benchmarking and comparing how a patch affects build times... I might look into that.
The graph is seconds against individual files. You're right that I should confine it to the files changed.
I'm going to start adapting and incorporating these changes in PR #40. Thank you for laying the groundwork, @Columbus240!
I've merged in the obvious improvements; I think the rest represent too much additional complexity (and thus, future maintenance burden) to be worth the performance improvements. I'll see if I can debug things further and find more systemic changes that could help things move faster. Thank you for all your work and input!
I'm happy that you could do something with this.
The proof-automation was very slow in the files I changed here. By providing some terms manually and being more explicit where the "respects" statements go, which simplify the goal, I could speed it up. Especially the
normal
tactic (in Relevant.v) was slow and theLocal Obligation Tactic
in Product.v. But I don't recall the precise circumstances of the latter.The only time measurements I currently have (but not written down) are from stepping through the proofs and from running make. I could do some proper measurements if you want.
Stilistically, it isn't very pretty. I'm open to suggestions.