Open Alisa-lisa opened 6 months ago
Thanks for reporting this! For my own reference, the issue is that this doesn't get the EOT from the tokenizer - instead, it assumes that it's the hardcoded token </s>
. This made sense in the early days of LLaMA, but is no longer true:
https://github.com/rustformers/llm/blob/e61e5f9461d6c7a14455846bdeba13479e16f396/crates/models/llama/src/lib.rs#L373
I have discovered that running the same model with the same parameters from llm (gguf branch) and llama.cpp results in a different behavior. llm seems to have not been reading EOS token and thus the model creates output until max tokens is reached. Here is llama.cpp:
And the same model from llm:
![llm](https://github.com/rustformers/llm/assets/4137964/b122d6cb-f8cc-4886-9483-64421c2ed0ed)
According to discord "discussion" it might be indeed a bug.