Unicode scanning issues

Right now, the scan engine works by first finding the equivalence class of a character and then looking the <state, class> pair up in a state transition table. Classical ASCII scanners worked a byte at a time, so in exchange for 256 bytes you get the character class with a single indirect read.

Unicode compliance means working a code-point at a time. Right now, the scan engine determines character class via the bisection algorithm followed by an array lookup. But most states only distinguish among very few character classes. For example, in a typical programming language, having scanned a letter you no longer care about the distinction between different kinds of punctuation until you've finished scanning the identifier. So in theory (and in practice with most hand-written scanners) you have a separate set of particularly-relevant class bounds for each state. Maybe you find that most states don't care about most class bounds; maybe you find a subset of bounds that handles most states and a list of exceptional states that need the full list of bounds.

Upon reflection, this particular juice is not worth the squeeze. However, it may inspire some ways of dealing with the complexity of Unicode character classes. Ideally a Unicode application would rely on the system-wide Unicode database, but this means coming up with an intelligent way to label state transitions in the finite automaton models. Right now, the scanner-generator uses (essentially) a run-length encoded boolean vector of unbounded size to represent subsets of characters for edge labels. To get full Unicode compliance, this strategy would need a complete overhaul.

Low-ASCII code-points have a lot of variety, so it makes sense that a 128-entry look-up table might yield a small speed improvement in the character classifier most of the time. Adding this, and comparing the results using the timeit module on the Pascal example, would be a fun exercise.

There are some bytes-oriented applications where a 256-entry look-up table would be completely sufficient.

For non-ASCII Unicode, the current approach to character class representation in the NFA is just not right. The question, though, is how to correctly perform set operations (intersect, union, complement) on the kinds of character classes which Unicode defines for REGEX purposes. Won't be solved tonight, but will be solved eventually.

kjosib / booze-tools

Unicode scanning issues #28