giellalt / template-lang-und

A template repo for new languages, as well as to update existing language repos with.
https://giellalt.uit.no/
GNU Lesser General Public License v3.0
2 stars 1 forks source link

pmscript alphabet chokes on some Unicode character ranges #14

Open reynoldsnlp opened 3 years ago

reynoldsnlp commented 3 years ago

Some of my students are getting errors when they try to define unicode ranges in tools/tokenisers/tokeniser-disamb-gt-desc.pmscript. Specifically, Myanmar and Tengwar ranges have been problematic.

@rkechols could you paste the output that you get when you try to declare the Tengwar range (for lang-qya)? Can those characters be declared individually (each inside curly braces {}) as a workaround?

rkechols commented 3 years ago

This range causes problems: "-" ! U+E000 - U+E033 Its output:

make[3]: Entering directory '/home/echols14/dight-390r/quenya-giellalt/tools/tokenisers'
  HXFST    analyser-disamb-gt-desc.hfst
  HSUBST   analyser_relabelled-disamb-gt-desc.hfst
  HXFST    analyser-url-gt-desc.hfst
  HPM2FST  tokeniser-disamb-gt-desc.pmhfst
HfstException: pmatch parsing failed: Could not parse range expression: "-"
*** parsing dings. Empty readings are also
!! legal in CG, they get a d... [truncated] at line 117 near "-"

rm analyser_relabelled-disamb-gt-desc.hfst analyser-url-gt-desc.hfst analyser-disamb-gt-desc.hfst
make[3]: Leaving directory '/home/echols14/dight-390r/quenya-giellalt/tools/tokenisers'

Doing those two endpoint characters individually works fine ({}|{}). I have not tried doing every individual char in the range, but I could if that would be helpful.

rkechols commented 3 years ago

I just tried it with every char individually, and it does then work.

flammie commented 3 years ago

I think thsi was by design in original pmatch at least, unicode code point ranges often are surprisign and not too useful, for example the [a-ö] range many people around here might use will not be what most who write it expect.

snomos commented 3 years ago

I have discussed with the original author support for various Unicode character classes. Right now pmatch does not have very good Unicode support, it is a known pain point. Linking to ICU or some such would be needed + update of some functions to properly support Unicode character classes.

Being able to define all punctuation by just referring to such a class would be very useful, as well as proper case folding for all writing systems with case distinctions (presently e.g. Cyrillic does not fold properly, as most non-ASCII letters).

TinoDidriksen commented 3 years ago

All the binary builds of HFST are linked against ICU. And other tools in the ecosystem require ICU anyway. I would strongly argue for just dropping HFST's non-ICU mode so that things like this can be implemented.