kjosib / booze-tools

Booze Tools will become the complete programming-language development workbench, all written in Python 3.9 (for now).
MIT License
14 stars 1 forks source link

Investigate DFA compaction alternatives #8

Closed kjosib closed 5 years ago

kjosib commented 5 years ago

In any given DFA state, the two most-common transitions tend to account for the vast majority of them, but which these are may differ much from state to state. With this in mind, the DFA state transition matrix may be separated into three parts:

The first of these can be stored simply as a pair of lists, being now two cells per state.

The second may be effectively stored in a simple list of exceptions per state (now approximately two cells per exception, plus one per state), or a displacement table which can conveniently be probed in small constant time. Due to the very strongly Zipf-ian nature of the exception cases, such a table can generally be packed quite efficiently, yielding a near-minimal perfect-hash.

The third can be condensed quite effectively by equivalence classes among both rows and columns; null entries may be coalesced in any convenient direction. This adds a cell per row and another per column to store the equivalence class IDs. Finally the residue matrix must be stored. It, too, is strongly Zipf-ian: most rows will have relatively few bits set and a little care(*) guarantees that none will have more than half, so a convenient structure for quick probes is just the "offset" and "check" vectors from a typical displacement table. (Essentially, set membership is being tested.)

The structure so described is admittedly a tad more complex to understand than the existing approach, but it seems to merit investigation.

(*) The "care" alluded to is as follows: Column-equivalence is computed first. Then, any rows with more than half their bits set get their "most-common-two" swapped and the bits flipped, thus ensuring that indeed they will be mostly-zero. Finally, row-equivalence may be computed in the usual manner before constructing the offset/check structure by the "first-fit-decreasing" discipline.

kjosib commented 5 years ago

This idea was checked in on 7 May 2019, along with a small refinement respecting the fact that the usual most-common transition is to the error state.