Existing work: Text normalization

rust-ml / nlp-discussion

15 stars 0 forks source link

Existing work: Text normalization #2

Open danieldk opened 5 years ago

rth commented 5 years ago

Assuming this also includes text pre-processing,

Unicode normalization

https://github.com/unicode-rs/unicode-normalization : Unicode Normalization forms according to UAX#15 rules
Possibly https://github.com/kornelski/deunicode/ : Convert Unicode to ASCII

Case folding

str::to_ascii_lowercase ASCII conversion to lowercase, only ASCII characters, fast, can be done in place.
str::to_lowercase Unicode aware conversion to lowercase, can change the length of the string (some characters can expand into multiple characters when changing the case), cannot be done inplace, relatively slow.
Some intermediary solution between the above two, as discussed in https://github.com/rust-lang/rust/issues/26244#issuecomment-344525748 . Related projects,
- https://github.com/JuliaStrings/utf8proc

danieldk commented 5 years ago

In conllx-utils we have a utility (conllx-cleanup) that first normalizes unicode and then rewrites some non-ASCII unicode punctuation signs to ASCII:

https://github.com/danieldk/conllx-utils/blob/master/src/bin/conllx-cleanup.rs https://github.com/danieldk/conllx-utils/blob/master/src/unicode.rs

This helps particularly if the training corpora for a model do not contain such non-ASCII punctuation characters (e.g. the German treebank that we use was originally ISO-8859-15), though the impact is smaller when word embeddings are used.

This is a niche utility, but it shows another type of normalization that would be useful to have in a general normalization crate.

xd009642 commented 5 years ago

If this includes text preprocessing there's also https://github.com/Matthew-Maclean/english-numbers/ which I use although it seems to be no longer actively developed and doesn't support ordinals. I generally have a need for replacing things like numbers and symbols (i.e. $) with their textual equivalent and have a feeling a lot of that work is yet to be done in the rust NLP space

Garvys commented 4 years ago

For FST-based text normalization, I re-implemented the c++ library OpenFST in full Rust : https://github.com/Garvys/rustfst (And it has better performances than OpenFST 😅 ) Might prove usefull.