implement do_handle_chinese_characters in tokenizing

As of yet I haven't tried what happens with Chinese/Japanese characters in tokenization. Some special handling is required since these languages don't have spaces between words.

It should be relatively simple to copy existing implementation:

Get inspiration from existing implementation like: https://github.com/huggingface/tokenizers/blob/ef5f50605ddf9f8caef1598c0e4853862b9707a7/tokenizers/src/normalizers/bert.rs#L98
Implement that in bert.cpp -> bert_normalize_prompt
Add some test cases with Asian languages to test_tokenizer.cpp, get the expected results from python Transformers lib tokenizer.

Alternatively: Replace the whole tokenizer with the huggingface rust implementation? It should probably be at least simplified a little bit, but I would be fine adding some rust code here if it doesn't complicate the build too much.

skeskinen / bert.cpp

implement do_handle_chinese_characters in tokenizing #1