rust-ml / nlp-discussion

15 stars 0 forks source link

Existing work: Tokenization #1

Open danieldk opened 5 years ago

jbowles commented 5 years ago

Do we want to include string distance metrics as part of tokenization or a separate project? In this light, would it make sense to include string distance metrics (e.g., levenshtein, jaro, etc..) as part of a larger package for distance metrics in general (e.g. euclidean, haversine, etc...).

danieldk commented 5 years ago

I would separate string distance/sequence alignment from tokenization. They are useful for a lot of other things, e.g. we use them for restoring case in named entities after lemmatization. It has all kinds of tricky corner cases and we use sequence alignment to identify characters that may have changed in case conversions. Another example are 'Chrupała-style' lemmatizers, which rely on edit scripts.

String distances are of course, also used a lot in information retrieval to do fuzzy matching. But in such cases Levensthein automata are often more efficient (to avoid matching against a word list). This is supported by burntsushi's fst-levenshtein crate:

https://crates.io/crates/fst-levenshtein

We have developed a very generic crate for sequence alignment:

https://docs.rs/seqalign/0.2.1/seqalign/

It is generic both over the type of data to be aligned (characters, bytes, molecules), and the set of edit operations. So, you can create your own alignment types or distances by combining different operations or inventing new opeartions. So, in contrast to most other alignment/distance crates, it is not limited to a fixed set of measures.

danieldk commented 4 years ago

Rust binding for the Alpino tokenizer (for Dutch):

https://crates.io/crates/alpino-tokenizer

Garvys commented 4 years ago

For FST-based tokenizer, I re-implemented the c++ library OpenFST in full Rust : https://github.com/Garvys/rustfst (And it has better performances than OpenFST 😅 ) Might prove usefull. In the OpenFST doc, it is explained how it can be used to create tokenizers : http://www.openfst.org/twiki/bin/view/FST/FstExamples

danieldk commented 4 years ago

I have also created bindings for the sentencepiece unsupervised tokenizer:

https://crates.io/crates/sentencepiece

Still have to bind the training parts. But currently, it allows one to load up a sentencepiece model and tokenize text.

Garvys commented 4 years ago

Nice ! 👍

rth commented 4 years ago

Also https://github.com/huggingface/tokenizers recently.