kjosib / booze-tools

Booze Tools will become the complete programming-language development workbench, all written in Python 3.9 (for now).
MIT License
14 stars 1 forks source link

The "Bipartite Propagation Closure" approach can be generalized to a "monotone dataflow framework"-based approach. #48

Closed modulovalue closed 7 months ago

modulovalue commented 10 months ago

Hello again @kjosib!

This issue is about the following document: https://github.com/kjosib/booze-tools/blob/987cd90d040cc6eba2d46cebd85db0c091cc8692/blather/Bipartite%20Propagation.md?plain=1#L1

and the following implementation: https://github.com/kjosib/booze-tools/blob/987cd90d040cc6eba2d46cebd85db0c091cc8692/boozetools/parsing/context_free.py#L317

I think your approach puts you in a local maximum. In the sense that I think it would be much better not to restrict yourself to this special case, but to be more general.

The more general point of view here is to consider the problem you are trying to solve as an instance of a "dataflow over monotone frameworks" problem (that is, e.g., https://people.cs.vt.edu/ryder/516/sp03/lectures/DataflowAnal-5-218.pdf). What you're after is a set of nodes and a set of edges. Each node has a finite semilattice whose LUB becomes the join operation over its children.

This gives you a simple termination argument (because, e.g., the conjunction and disjunction lattices are finite and their LUB is monotone). And it gives you an efficient general purpose implementation via propagating dataflow facts in (reverse) topological order through the SCC-condensed graph (as you've already noticed somewhere else). The worklist approach for solving dataflow problems is really suboptimal and it is known to be one of the least efficient methods for doing so.

I'm successfully using this approach for finding nullables, firstsets, followsets, lalr-lookaheads (via DeRemer and Pennello which boils down to this approach), and other metrics introduced by https://arxiv.org/pdf/2209.08383.pdf. Looking at it strictly from a dataflow problem perspective has simplified things by a lot for me.

I'm mentioning this because you seem to be interested in pushing the state of the art. I think this is the way to go. Taking the SCC-route has many benefits: it becomes possible to answer the "why"-question by walking the processed graph in reverse, and you can be incremental, and you can merge graphs to be even more efficient...). It would be cool to see you go even further into that direction than you already have.

I'm currently in the process of translating your tainting approach/PGM to a dataflow problem, but while they both seem to be correct, I'm having issues with grokking both of them in their non-dataflow-style presentation.

modulovalue commented 7 months ago

I'm closing this as stale.

kjosib commented 7 months ago

I think by "local maximum" you mean that the "bipartite closure" notion solves a few problems elegantly, but it's not the most general or necessarily the most efficient. For the practical grammars I deal with, performance has been perfectly adequate. If you have a better idea and you can demonstrate the freedom to share it then I'd consider a pull request...

modulovalue commented 7 months ago

I think by "local maximum" you mean that the "bipartite closure" notion solves a few problems elegantly, but it's not the most general or necessarily the most efficient.

Yes. However, performance is just one benefit of an SCC-based approach. an SCC-based approach gives you the freedom to precisely explain why the results are what they are. Since error messages are notoriously bad when it comes to grammar tools, I believe this is a worthwhile endeavor.

If you have a better idea and you can demonstrate the freedom to share it then I'd consider a pull request...

Unfortunately, I'm not familiar enough with Python to construct an example. And I don't plan to learn Python anytime soon. However, a softer introduction to this idea can be found in: https://cs.au.dk/~amoeller/spa/spa.pdf. It starts with the idea of a "worklist" and improves it several times. It concludes with a remark that an SCC-based approach could be used to further improve the worklist approach, which is what I'm referring to.

kjosib commented 7 months ago

I'll have a look at the link. Thank you.

Booze-Tools does already use Strongly Connected Components for this other problem of discovering LALR follower-sets.

Error messages are indeed a challenge. It's one thing to work out a path through a mathematical model that reaches an anomaly. It's quite another task to report that information in a way that clarifies the problem for someone who has very deliberately delegated the majority of the model's clerical detail to the computer's purview, as is the case in parser generation. If this SCC approach helps that, then I may have to adopt it next time I'm in a mood to hack on Booze Tools.