skeskinen / bert.cpp

ggml implementation of BERT
MIT License
463 stars 58 forks source link

implement do_handle_chinese_characters in tokenizing #1

Open skeskinen opened 1 year ago

skeskinen commented 1 year ago

As of yet I haven't tried what happens with Chinese/Japanese characters in tokenization. Some special handling is required since these languages don't have spaces between words.

It should be relatively simple to copy existing implementation:

  1. Get inspiration from existing implementation like: https://github.com/huggingface/tokenizers/blob/ef5f50605ddf9f8caef1598c0e4853862b9707a7/tokenizers/src/normalizers/bert.rs#L98
  2. Implement that in bert.cpp -> bert_normalize_prompt
  3. Add some test cases with Asian languages to test_tokenizer.cpp, get the expected results from python Transformers lib tokenizer.

Alternatively: Replace the whole tokenizer with the huggingface rust implementation? It should probably be at least simplified a little bit, but I would be fine adding some rust code here if it doesn't complicate the build too much.

skeskinen commented 1 year ago

Another implementation of bert tokenizing: https://github.com/zhihu/cuBERT/blob/master/src/cuBERT/tokenization.cpp Also, it would probably make sense to move the tokenization tests to python. That way it would be easy to compare with hf-transformers output.