rhdunn / cainteoir-engine

The Cainteoir Text-to-Speech core engine
http://reecedunn.co.uk/cainteoir/
GNU General Public License v3.0
43 stars 8 forks source link

create a FSA/RegEx compiler framework and use it in the document parsers #21

Open rhdunn opened 12 years ago

rhdunn commented 12 years ago

Currently, the document parsers (specifically the xml and rtf parsers/tokenizers) are hand-written. These work, but we can do better.

I want to use regular expressions for handling the dictionary word and rule matching engine.

Also, there are advanced things like identifying Chapters/Parts/Sections/Paragraphs/Authors in plain text documents and handling the different pipelines (e.g. iterating over words or utf-8 characters).

The basic pipeline is:

RE => IR => optimize => FSA => IR => optimize => ( machine code | interpreter )

RE : Regular Expression IR : Intermediate Representation / Intermediate Language FSA : Finite State Automata

The advantages of this are: 1/ there is no hand-written code (specifically lexer machinery) dealing directly with strings/string matching [*]; 2/ all string processing code goes through the same pipeline (code reuse); 3/ extra verification/coverage of the RE/FSA pipleline; 4/ easier to maintain; 5/ take advantage of the machine code machinery -- fast; 6/ optimizations made to the RE/FSA pipeline improve performance of the document parsers.

[*] There may be hand-constructed FSA or RE/IR objects.

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/1026795-create-a-fsa-regex-compiler-framework-and-use-it-in-the-document-parsers?utm_campaign=plugin&utm_content=tracker%2F254961&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F254961&utm_medium=issues&utm_source=github).