Minimise the DFA before writing the state machine

SquidDev commented 1 year ago

This runs Hopcroft's DFA minimisation algorithm on Lrgrep_support's DFA before generating generating the state machine. In the case of the OCaml example, this reduces the number of states from 2854 to 440 (with a corresponding reduction in file size).

While I've been using this on my local fork without issue, it's probably still a long way from production ready. It's quite slow (~400ms to run on the OCaml example), though not quite sure how to improve that.

I realise this project is on the back-burner right now, so happy if you want to close this!

let-def commented 1 year ago

Nice! I experimented with various ways to compact the DFA a long time ago, I tried DFA minimization but in the end I only kept the compact representation of transitions (explained here: https://def.lakaban.net/posts/2022-04-14-even-more-compact-lexer-table/ , not sure if this is common knowledge? I didn't find any article describing this).

Then I moved my focus to the more pressing issue of the DFA construction :) (reduction simulation was full of bug at that time... I hope is a bit more robust know :P).

Performance should not be a problem: I implemented Valmari minimization algorithm at that time (https://www.cs.cmu.edu/~cdm/papers/Valmari12.pdf), a variant of Hopcroft that is faster for large alphabets (afaik, when fixing other factors, Hopcroft complexity grows linearly wrt the size of the alphabet, Valmari grows logarithmically). Early tests indicate a reasonable speed-up, though I am not confident in my implementation :-)

let-def commented 1 year ago

I realise this project is on the back-burner right now, so happy if you want to close this!

Well, since mid-January I am focusing on this project :). I spent three weeks trying to understand what I call the "error space", the languages of stack that fails. They have a regular structure, so we can enumerate them. But the automaton is way too complex (for OCaml language) for a human to grok, so I am looking at other ways to make sense of it. The goal is to improve the tooling that checks for coverage and redundancy.

I will share my roadmap soon to keep you posted on what will come next!

let-def commented 1 year ago

(And actually 400ms is fine, the benefits of a smaller automaton outweigh this cost).

SquidDev commented 1 year ago

I experimented with various ways to compact the DFA a long time ago, I tried DFA minimization but in the end I only kept the compact representation of transitions (explained here: https://def.lakaban.net/posts/2022-04-14-even-more-compact-lexer-table/ , not sure if this is common knowledge? I didn't find any article describing this).

I must confess, most of my knowledge here comes from what I remember from university, so is definitely a little spartan! That (and your other article on lexer tables) was very helpful, thank you!

I implemented Valmari minimization algorithm at that time (https://www.cs.cmu.edu/~cdm/papers/Valmari12.pdf), a variant of Hopcroft that is faster for large alphabets

Oh, this sounds ideal. I did try to keep transitions using sets of input symbols, but that's still going to behave badly in the worst case.

Definitely echo your concerns about the correctness of implementation though. Just encountered a issue with this patch, where states with non-identical halting sets are considered equivalent. I think this is easily fixable (take the union of halting sets for this equivalence class as the new halting set), but now much less confident with the rest of my code!

I will share my roadmap soon to keep you posted on what will come next!

Thank you! I realise everything is still in the research stage, but excited to see this develop! It's been such a pleasure to use :).

let-def commented 1 year ago

I am closing the PR, the automaton is now minimized in master. Thanks for your help.

let-def / lrgrep

Minimise the DFA before writing the state machine #7