rust-ml / nlp-discussion

15 stars 0 forks source link

Existing work: Text normalization #2

Open danieldk opened 5 years ago

rth commented 5 years ago

Assuming this also includes text pre-processing,

Unicode normalization

Case folding

danieldk commented 5 years ago

In conllx-utils we have a utility (conllx-cleanup) that first normalizes unicode and then rewrites some non-ASCII unicode punctuation signs to ASCII:

https://github.com/danieldk/conllx-utils/blob/master/src/bin/conllx-cleanup.rs https://github.com/danieldk/conllx-utils/blob/master/src/unicode.rs

This helps particularly if the training corpora for a model do not contain such non-ASCII punctuation characters (e.g. the German treebank that we use was originally ISO-8859-15), though the impact is smaller when word embeddings are used.

This is a niche utility, but it shows another type of normalization that would be useful to have in a general normalization crate.

xd009642 commented 5 years ago

If this includes text preprocessing there's also https://github.com/Matthew-Maclean/english-numbers/ which I use although it seems to be no longer actively developed and doesn't support ordinals. I generally have a need for replacing things like numbers and symbols (i.e. $) with their textual equivalent and have a feeling a lot of that work is yet to be done in the rust NLP space

Garvys commented 4 years ago

For FST-based text normalization, I re-implemented the c++ library OpenFST in full Rust : https://github.com/Garvys/rustfst (And it has better performances than OpenFST 😅 ) Might prove usefull.