Open skeskinen opened 1 year ago
Another implementation of bert tokenizing: https://github.com/zhihu/cuBERT/blob/master/src/cuBERT/tokenization.cpp Also, it would probably make sense to move the tokenization tests to python. That way it would be easy to compare with hf-transformers output.
As of yet I haven't tried what happens with Chinese/Japanese characters in tokenization. Some special handling is required since these languages don't have spaces between words.
It should be relatively simple to copy existing implementation:
Alternatively: Replace the whole tokenizer with the huggingface rust implementation? It should probably be at least simplified a little bit, but I would be fine adding some rust code here if it doesn't complicate the build too much.