rustformers / llm

[Unmaintained, see README] An ecosystem of Rust libraries for working with large language models
https://docs.rs/llm/latest/llm/
Apache License 2.0
6.08k stars 362 forks source link

Warning: Bad token in vocab at index xxx #11

Closed CheatCod closed 1 year ago

CheatCod commented 1 year ago

running cargo run --release -- -m ~/dev/llama.cpp/models/7B/ggml-model-f16.bin -f prompt gives a bunch of "Warning: Bad token in vocab at index..." image

The path points to ggml converted llama model, which I have verified that they work with llama.cpp

setzer22 commented 1 year ago

Thanks for reporting @CheatCod! We are aware of the issue. Apparently, some models have tokens in the embedded vocabulary that use invalid UTF-8 codepoints. This is an error, but not an irrecoverable one. When detecting this, we simply print the warning and replace the broken tokens with the replacement character. The C++ code, on the other hand, simply ignores the issue (C++ strings are just byte arrays, while in Rust, they're required to hold valid UTF-8), but I thought it would be a good idea to at least warn the users.

There is some discussion about this at #3. Some of us don't see the errors, while others do. If I had to take a guess, this is something in the convert-pth-to-ggml.py script which, in one case is happily writing the invalid codepoints, and in other cases (like me, and others who reported the same), is substituting the invalid characters by the replacement character, so my model never had the broken utf-8 in the first place.

This could be due to a combination of the environment, OS, Python version, version of the conversion script, and who knows what else :thinking:

mwbryant commented 1 year ago

Hopefully this will be less intrusive after #10 merges and we have a better logging crate. Maybe it shouldn't even be a warn! and should be an info! (or even not print at all) because it really doesn't hurt anything according to @setzer22's digging into the actual token.

setzer22 commented 1 year ago

Yup, I'd say turning it into info would be nice to avoid log spam, since the issue seems to be quite common.

philpax commented 1 year ago

I've downgraded it to info in #10 in the CLI (which now controls the logs).

setzer22 commented 1 year ago

Not sure what to do about this. We often get reports about this issue, so leaving it open sounds like a good idea to avoid duplicates.

But also, there's very little we can do about those non-utf8 tokens in the vocabulary, they are most certainly an error and affect lots of users.

So, what should we do? One option is to downgrade the error even further down to trace level. Another is to simply leave things as is, since people can use the RUST_LOG env variable to ignore all info messages. But people will probably still want to see the other loading messages.

philpax commented 1 year ago

I think leaving it as-is is fine: we can hopefully figure out what's generating those invalid tokens when we get to #21.

mwbryant commented 1 year ago

I think it's probably safe to completely remove the print or make it a 1 time thing. From your research we are already replacing the tokens with the correct unicode character and cpp version of this code doesn't handle them in any special way (and I've noticed the cpp code sometimes segfaults for me but this library never does). It seems to be caused upstream from llama-rs and we have already solved it without any more reported problems related except the print outs.

philpax commented 1 year ago

I'd like to investigate what's causing this. My feeling is that while the tokens themselves are not valid UTF-8, their use in the generated output is (e.g. two tokens form a valid string). I'm also curious if the newest llama.cpp tokeniser addresses this.

philpax commented 1 year ago

I checked my theory and it appears to be correct - the tokens are not guaranteed to be valid UTF-8 by themselves.

With invalid tokens enabled:

llama-rs # cargo run --release --bin llama-cli -- infer -m ../llama-models/v0/ggml-alpaca-7b-q4.bin -f examples/alpaca_prompt.txt -p "How do I say 'This is a complex sentence that will require some Unicode' in Japanese kanji?" --seed 943589183
[...]

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

How do I say 'This is a complex sentence that will require some Unicode' in Japanese kanji?

### Response:
この������は、一つでも文字化けな������を持たせる必要があります。

With raw bytes:

llama-rs # cargo run --release --bin llama-cli -- infer -m ../llama-models/v0/ggml-alpaca-7b-q4.bin -f examples/alpaca_prompt.txt -p "How do I say 'This is a complex sentence that will require some Unicode' in Japanese kanji?" --seed 943589183
[...]

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

How do I say 'This is a complex sentence that will require some Unicode' in Japanese kanji?

### Response:
この試験は、一つでも文字化けな困險を持たせる必要があります。

I'm thinking we should return the raw byte slices, but offer a helper to coalesce tokens until they form valid UTF-8.

philpax commented 1 year ago

After some discussion on the Discord, we came to the conclusion that the current behaviour is in fact a bug. The LLM infers over byte-tokens, not UTF8-tokens.

That being said, we'd still like to make using the library easy, so I'm going to switch everything over to use &[u8] except for inference_with_token, which will buffer tokens until they form valid UTF-8. I'll also expose the adapter it uses for this, so that users can do something similar when they fetch the byte-tokens themselves.