Open danieldk opened 5 years ago
I would separate string distance/sequence alignment from tokenization. They are useful for a lot of other things, e.g. we use them for restoring case in named entities after lemmatization. It has all kinds of tricky corner cases and we use sequence alignment to identify characters that may have changed in case conversions. Another example are 'Chrupała-style' lemmatizers, which rely on edit scripts.
String distances are of course, also used a lot in information retrieval to do fuzzy matching. But in such cases Levensthein automata are often more efficient (to avoid matching against a word list). This is supported by burntsushi's fst-levenshtein
crate:
https://crates.io/crates/fst-levenshtein
We have developed a very generic crate for sequence alignment:
https://docs.rs/seqalign/0.2.1/seqalign/
It is generic both over the type of data to be aligned (characters, bytes, molecules), and the set of edit operations. So, you can create your own alignment types or distances by combining different operations or inventing new opeartions. So, in contrast to most other alignment/distance crates, it is not limited to a fixed set of measures.
Rust binding for the Alpino tokenizer (for Dutch):
For FST-based tokenizer, I re-implemented the c++ library OpenFST in full Rust : https://github.com/Garvys/rustfst (And it has better performances than OpenFST 😅 ) Might prove usefull. In the OpenFST doc, it is explained how it can be used to create tokenizers : http://www.openfst.org/twiki/bin/view/FST/FstExamples
I have also created bindings for the sentencepiece
unsupervised tokenizer:
https://crates.io/crates/sentencepiece
Still have to bind the training parts. But currently, it allows one to load up a sentencepiece model and tokenize text.
Nice ! 👍
Also https://github.com/huggingface/tokenizers recently.
Do we want to include string distance metrics as part of tokenization or a separate project? In this light, would it make sense to include string distance metrics (e.g., levenshtein, jaro, etc..) as part of a larger package for distance metrics in general (e.g. euclidean, haversine, etc...).