Open snomos opened 1 year ago
More about the second case:
The reason we want such base letter + combining diacritic as a multichar symbol in all other cases is that it makes life easier to treat things that looks like single letters as actually single symbols even when the underlying Unicode is not a single code point. It is only hfst-tokenise
that have issues with this, because of the task at its hand.
Excellent work in commit https://github.com/giellalt/giella-core/commit/573af75a2a750566843c8d40e1438f2f6f8a316e. Only problem is: it fails on macOS, probably due to a different version of awk
or sed
.
Another comment: would it be possible to filter out all non-diacritic characters, to avoid both noise and extra cpu time when compiling and composing the generated regex?
... it fails on macOS, probably due to a different version of
awk
orsed
. If this turns out to be a problem, it remains me of a similar situation, which I solved by installing gsed:/usr/bin/sed /usr/local/bin/gsed
I changed awk to gawk but not the sed command yet, I think also gnused is used; we have a configure script in langs for checking that could be useful but not sure if everyone runs configure in core even
One has to run ./configure
in core, to get the correct version info. But the tools in core doesn't carry automatically over to each language, so we need to run the same check there.
mmh I have now some checks in core for gnu sed and gawk and the unicode filter scripts use the configured programs.
This covers two distinct cases:
In the first case, the pseudo code could go something like this:
In the second case, the pseudocode could be something like the following:
With routines like the above integrated into the build system, no-one should ever have to worry about these issues anymore 🙂