Closed snowyu closed 7 months ago
I think the issue here is that "gpt" is not actually in the vocabulary. It works for other examples such as "hell"/"hello". I was having some issues with the original code going over n_max_tokens
, so I ended up changing a couple of things.
Yes,the "hello" is right and "gpt" is not in the voc. But the "gpt" in HF's tokenizer is splited to ["gp", "t"]
instead of ["g", "pt"]
.
I think we should use HF's tokenizer as a benchmark for testing, right?
Ah, I see, thanks! Found the bug in the tokenizer, was double incrementing in one loop. Fixed in c7026df3abe1fb7cc8ecd1fab7d564a28d975d48. Should work now.
Rollback to oringinal code to correct it.
test with "gpt", it should be splitted to
["gp", "t"]
.