Open reynoldsnlp opened 3 years ago
This range causes problems: "-" ! U+E000 - U+E033
Its output:
make[3]: Entering directory '/home/echols14/dight-390r/quenya-giellalt/tools/tokenisers'
HXFST analyser-disamb-gt-desc.hfst
HSUBST analyser_relabelled-disamb-gt-desc.hfst
HXFST analyser-url-gt-desc.hfst
HPM2FST tokeniser-disamb-gt-desc.pmhfst
HfstException: pmatch parsing failed: Could not parse range expression: "-"
*** parsing dings. Empty readings are also
!! legal in CG, they get a d... [truncated] at line 117 near "-"
rm analyser_relabelled-disamb-gt-desc.hfst analyser-url-gt-desc.hfst analyser-disamb-gt-desc.hfst
make[3]: Leaving directory '/home/echols14/dight-390r/quenya-giellalt/tools/tokenisers'
Doing those two endpoint characters individually works fine ({}|{}
).
I have not tried doing every individual char in the range, but I could if that would be helpful.
I just tried it with every char individually, and it does then work.
I think thsi was by design in original pmatch at least, unicode code point ranges often are surprisign and not too useful, for example the [a-ö] range many people around here might use will not be what most who write it expect.
I have discussed with the original author support for various Unicode character classes. Right now pmatch does not have very good Unicode support, it is a known pain point. Linking to ICU or some such would be needed + update of some functions to properly support Unicode character classes.
Being able to define all punctuation by just referring to such a class would be very useful, as well as proper case folding for all writing systems with case distinctions (presently e.g. Cyrillic does not fold properly, as most non-ASCII letters).
All the binary builds of HFST are linked against ICU. And other tools in the ecosystem require ICU anyway. I would strongly argue for just dropping HFST's non-ICU mode so that things like this can be implemented.
Some of my students are getting errors when they try to define unicode ranges in
tools/tokenisers/tokeniser-disamb-gt-desc.pmscript
. Specifically, Myanmar and Tengwar ranges have been problematic.@rkechols could you paste the output that you get when you try to declare the Tengwar range (for
lang-qya
)? Can those characters be declared individually (each inside curly braces{}
) as a workaround?