bug: bert_tokenize can not find the longest token

iamlemec / bert.cpp

GGML implementation of BERT model with Python bindings and quantization.

MIT License

51 stars 4 forks source link

bug: bert_tokenize can not find the longest token #6

Closed snowyu closed 7 months ago

snowyu commented 7 months ago

Rollback to oringinal code to correct it.

test with "gpt", it should be splitted to ["gp", "t"].

iamlemec commented 7 months ago

I think the issue here is that "gpt" is not actually in the vocabulary. It works for other examples such as "hell"/"hello". I was having some issues with the original code going over n_max_tokens, so I ended up changing a couple of things.

snowyu commented 7 months ago

Yes,the "hello" is right and "gpt" is not in the voc. But the "gpt" in HF's tokenizer is splited to ["gp", "t"] instead of ["g", "pt"].

I think we should use HF's tokenizer as a benchmark for testing, right?

iamlemec commented 7 months ago

Ah, I see, thanks! Found the bug in the tokenizer, was double incrementing in one loop. Fixed in c7026df3abe1fb7cc8ecd1fab7d564a28d975d48. Should work now.