Closed notgull closed 1 year ago
I'm not sure I would call the nom
variant more maintainable. It's a complete gibberish for me. And performance is a big question as well.
If you have time, I would suggest compiling master ragel
(which is an autotools
hell and you need colm as well) and try its Rust output generator. I've tried it a couple of weeks ago and it was failing with a cryptic error.
But I think that modifying hb ragel files to output Rust code directly via ragel is far better from maintainability standpoint.
Superseded by #77
The
_machine.rs
files are unmaintainable, as they are a 1:1 handwritten mirror of the machine-generated state machine files for Harfbuzz. My goal is to make this crate more maintainable by writing these files in such a way that it corresponds with the.rl
files. This way, changes to the.rl
files can be easily reflected inrustybuzz
.My original attempt to replace the
indic_machine.rs
file, however, has not passed tests. I am not familiar withragel
itself and I am mystified by its semantics. As there isn't aragel
chat room I can ask for help, as far as I know, I figure that this is the best place to ask.I've focused on one specific test,
indic_old_spec_003
. The input to the parser looks like this:The parser for the
consonant_syllable
rule looks like this:Source
For this rule, the first
(Repha|CS)?
block evaluates to no input, as the first item isC
which is neitherRepha
norCS
. The next item iscn
, which matches theC
tag and consumed it. The next tag isH
, which doesn't matchZWJ
orn
, so thecn
rule completes and we move on tocomplex_syllable_tail
.The
complex_syllable_tail
rule starts with(halant_group.cn)*
, which would originally match theH
, but there is noC
at the end, so this rule evaluates to no input. the nextmedial_group
rule isCM?
. As the current input isH
,CM
doesn't match, so this rule evaluates to no input as well.halant_or_matra_group
goes tofinal_halant_group
which goes tohalant_group
which matchesH
and nothing else, consuming it. Finally,syllable_tail
matches the lack of input at the end. Therefore the range from0..2
is classified as a constant syllable.Then, the next item on the chopping block is
CM
. Out of all the rules, this matches thecomplex_syllable_tail
part ofbroken_cluster
, along with theH
. Therefore2..4
is a broken cluster. Finally, theX
at the end becomes a non-indic character.However, this fails the test. After wiring some telemetry to the current
rustybuzz
master, I've found that it classifies the range from0..4
as a consonant syllable and the4..5
range as non-indic. I'm not sure how it does this; it feels like what would happen is that theH
is somehow consumed by either thecn
or thehalant_group.cn
before theCM
is consumed by themedial_group
. Unlessragel
's semantics are wildly different from what I understand, it is unclear to me how this would happen.Is there anyone who knows
ragel
well enough to help my understanding of it here?cc #74, @bluebear94