ocaml / dune

A composable build system for OCaml.
https://dune.build/
MIT License
1.6k stars 395 forks source link

[coq] sandbox extraction rules #5835

Open Alizter opened 2 years ago

Alizter commented 2 years ago

We should sandbox the extraction rules. We don't usually have many of them, so it would pay off to make sure they are sound.

ejgallego commented 2 years ago

Recall that Coq needs all transitive .vo present when sandboxed. This will likely make any sandboxing impractical unless you introduce a cut.

Alizter commented 2 years ago

coq.extraction stanzas typically invoke a single rule, so sandboxing this rule, even if it depends on all the vo files in a theory, shouldn't be expensive. We can use symlinks, hardlinks and copy modes for sandboxing. If copy was the default case, I would agree with you, however I don't think this is the case.

ejgallego commented 2 years ago

The expensive part here will be the combinatorial explosion of the dep DAG.

rgrinberg commented 2 years ago

If an action is expected to read n dependencies, surely the cost of adding n edges in the build system is trivial in comparison.

ejgallego commented 2 years ago

If an action is expected to read n dependencies, surely the cost of adding n edges in the build system is trivial in comparison.

I was assuming adding n edges would increase memory use by n, isn't that correct?

If that's the case, your memory consumption gets O(k^n) where n is the number of extraction stanzas and k is a very bad number for practical purposes. I guess that's fine if you got a few extraction stanzas, but even so it degenerates quickly.

rgrinberg commented 2 years ago

I was assuming adding n edges would increase memory use by n, isn't that correct?

Sure it would. It's just a trivial amount in comparison to how expensive running this action is.

If that's the case, your memory consumption gets O(k^n) where n is the number of extraction stanzas and k is a very bad number for practical purposes. I guess that's fine if you got a few extraction stanzas, but even so it degenerates quickly.

The math gets trickier with multiple stanzas. Consider the following:

So in practice, this isn't going to strain dune very much at all.

ejgallego commented 2 years ago

If you use a glob you increase build times by a lot, we can measure that using coq universe now.

rgrinberg commented 2 years ago

And yet we do essentially the exact same thing for OCaml and our overhead is somewhere in the noise. I don't think our workspaces are much smaller than coq-universe.

Sure, the overhead from sandboxing then starts to matter when you add this many deps. But in this case, I'd argue that coq's extraction is a non modular process that even this overhead is meaningless.

ejgallego commented 2 years ago

I was worried about the overhead in terms of compilation, as you depend on too much stuff. For example, if you depend on (glob_files *.vo) in Compcert, it may be very possible that it increases quite a lot the build time as you have now added some files (that in fact are kept separated for that reason) to the ML binary build path.

There is rarely going to be a case in Coq where over approximating dependencies is a win, due to how costly type-checking is.

Alizter commented 2 years ago

@ejgallego I am not really following your last answer, could you elaborate? How does depending on all vo files for a single extraction rule cause any issues here? You need the vo files when running extraction, so what do we gain by no specifying them apart from incomplete deps?

ejgallego commented 2 years ago

@Alizter let me do an example, for example, with CompCert's Deadcode.v and Deadcodeproof.v, where the proofs take a lot longer to compile than the algorithm.

Now, with the current setup, extraction only depends on Deadcode.vo, thus, rules depending on compcert.exe can start to run much faster as they don't need to wait for Deadcodeproof.v.

With the (glob_files ), now you've introduced a large amount of latency in this common workflow.

Alizter commented 2 years ago

@ejgallego What is the relationship between Deadcode.v and Deadcodeproof.v? Doesn't the extraction stanza let you specify which modules you want to extract?

We only need to depend on the vo files in the dependency graph.

ejgallego commented 2 years ago

The relationship is Deadcodeproof depends on the implementation Deadcode. Extraction depends on Deadcode

We only need to depend on the vo files in the dependency graph.

You just said before "How does depending on all vo files for a single extraction rule cause any issues here?" Hence my answer.

If you depend only on the files you need, that's a different story.

Alizter commented 2 years ago

Right then we have some miscommunication. Are you in principle ok with depending on all vo files that are transitive dependencies?

I know we mentioned that coqc does some trick to avoid this, but here we really don't need to do that.

ejgallego commented 2 years ago

Yes, I'm OK. As I mentioned in private I'm just trying to be sure you are aware of all the tradeoffs, as I first implemented the Coq rules with sandboxing but were very slow.

My previous comment was about Rudi's trick, indeed, you can avoid the combinatorial memory explosion of sandboxing in the case of Coq, by indeed adding a single edge to a (glob node, but that may be costly (or not, but with the current experiments we are doing I bet it is, as with the current no-sandboxed setup our dep graph is very limit but seems can remain precise for now)

I wish Coq was more like go tho and we didn't have to depend on all the files a .vo ever saw.

rgrinberg commented 2 years ago

@ejgallego thanks for clarifying. I think I understand the issue now. Allow me to rephrase it to confirm:

We would like to not approximate dependencies at the level of theories with globs. Our dependencies will already be precisely specified by the transitive closure of coqmod.

If that's the argument, then I agree that sandboxing in its current is suboptimal for coqc. All the other arguments about the overhead of globs or the memory taken by their associated dependency edges are of dubious importance.

Unfortunately, I might have to move the goalposts a bit to get sandboxing for the extraction rules anyway. If we want to avoid specifying all the extracted targets, we must use directory targets. And the current implementation of directory targets requires the rules producing them to be sandboxed. So if we want to move to this more convenient system that avoids specifying all the targets, we'll need to compromise performance a little bit.

ejgallego commented 2 years ago

Thanks for the summary @rgrinberg ; I think that indeed the tradeoff between more precise deps and faster build is still not fully understood for Coq projects, we can measure it now tho. In general, I'd say that Coq users will complain a lot for unnecessary rebuilds as unfortunately Coq is very slow and provides a bad dev experience.

The cost of sandboxing itself when compared to the cost of coqc is IMHO negligible. Dep edges can grow exponentially so in the wider context you gotta be careful about Dune's mem comsumption.

Regarding extraction, maybe indeed that would be a good candidate for the Action protocol, so coqc could tell dune what targets it has generated. There are other ways to do that too maybe as you folks have discussed.

So far people using (coq.extraction ...) where not too annoyed to have to list the modules tho, so I think that feature is for now "nice to have".

rgrinberg commented 2 years ago

So far people using (coq.extraction ...) where not too annoyed to have to list the modules tho, so I think that feature is for now "nice to have".

I think it's just a matter of time through. Listing the modules is inherently anti modular if I understand correctly how it works. Consider a user writing a theory A that depends on some theory B. The user wants to setup an extraction rule that uses A. The list of targets depends on the modules of A and B. So if B changes by introducing a new module, the extraction rule will be broken even if the change is backwards compatible.