Open danieldk opened 5 years ago
In conllx-utils
we have a utility (conllx-cleanup
) that first normalizes unicode and then rewrites some non-ASCII unicode punctuation signs to ASCII:
https://github.com/danieldk/conllx-utils/blob/master/src/bin/conllx-cleanup.rs https://github.com/danieldk/conllx-utils/blob/master/src/unicode.rs
This helps particularly if the training corpora for a model do not contain such non-ASCII punctuation characters (e.g. the German treebank that we use was originally ISO-8859-15), though the impact is smaller when word embeddings are used.
This is a niche utility, but it shows another type of normalization that would be useful to have in a general normalization crate.
If this includes text preprocessing there's also https://github.com/Matthew-Maclean/english-numbers/ which I use although it seems to be no longer actively developed and doesn't support ordinals. I generally have a need for replacing things like numbers and symbols (i.e. $) with their textual equivalent and have a feeling a lot of that work is yet to be done in the rust NLP space
For FST-based text normalization, I re-implemented the c++ library OpenFST in full Rust : https://github.com/Garvys/rustfst (And it has better performances than OpenFST 😅 ) Might prove usefull.
Assuming this also includes text pre-processing,
Unicode normalization
Case folding
str::to_ascii_lowercase
ASCII conversion to lowercase, only ASCII characters, fast, can be done in place.str::to_lowercase
Unicode aware conversion to lowercase, can change the length of the string (some characters can expand into multiple characters when changing the case), cannot be done inplace, relatively slow.