Closed martin-gorner closed 1 month ago
There seems to be a problem left:
model.preprocessor("A")
appends the end_token_id to the tokenized output. This will not be the correct id for "Instruct" Llama variants. Any guidance on how to address this properly?
More issues found. The output is different when using the HF vs. the Keras tokenizer. Special tokens are not being parsed by the Keras preprocessor.
txt = "<|start_header_id|>system<|end_header_id|>hello<|end_of_text|>"
tok = transformers.AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
print("HF tokenizer\n", tok(txt).input_ids)
prepro = keras_hub.models.Llama3CausalLMPreprocessor.from_preset("hf://meta-llama/Llama-3.2-1B-Instruct")
print("Keras proprocessor\n",prepro(txt)[0]["token_ids"][:40])
output:
# 128000, 128006, 128007, 128001 are the expected special tokens
HF tokenizer output:
[128000, 128006, 9125, 128007, 15339, 128001]
Keras preprocessor output:
[128000 27 91 2527 62 2775 62 307 91 29
9125 27 91 408 62 2775 62 307 91 29
15339 27 91 408 62 1073 62 1342 91 29
128001 0 0 0 0 0 0 0 0 0]
This happens with or without the present PR. Investigating...
Added a fix for the preprocessing of special tokens.
Added the loading of the correct eos_token from the Hugging Face config. The info unfortunately does nto exist in the Keras config so a hack is used to make generation work in all cases.
Last remaining problem: The packer in the preprocesor: model.preprocessor("A")
appends the end_token_id to the tokenized output. It will be the wrong end_token for Llama3 "instruct" variants loaded from Keras checkpoints.
Thanks Martin! the presets would need to be regenerated after this, correct?
My PR does not change anything to Keras Llama presets. They still have their built-in end_token, which is wrong for -instruct variants. I just added a hack that declares both possible end_tokens as stop characters for generation so that generation at least stops.
If you have a way of storing the correct end_token in the preset, that would be a good fox for the Keras version indeed. Is that possible? How would versioning work? All Llama3-instruct models need this: Llama2, Llama3.0, Llama3.1
The Deeplabv3 test failure does not seem to be related to this PR... There isn't much I can do to fix it.
If you have a way of storing the correct end_token in the preset, that would be a good fox for the Keras version indeed. Is that possible? How would versioning work? All Llama3-instruct models need this: Llama2, Llama3.0, Llama3.1 I think this would be a good thing to add for tokenizer.json in presets. Regarding versioning, we can upload new presets files on Kaggle and the new version number needs to be updated in the presets file for the model.
Hey @martin-gorner! Why aren't we hard coding all the special tokens in the Llama tokenizer? We already know what all the special tokens are and we don't expect them to change, right? So why not hard-code them all? Wouldn't that fix the issue for both loading from a Keras and a HF checkpoint?
Hardcoded or loaded from a fixed preset for all Llamas does not make much difference. The difficulty is to select the correct "end of text" token for instruct and non-instruct models. This is easy in the Hugging Face approach where each model has its own config file.
For example, the current behavior in Keras is:
# "instruct" variant
tokenizer - keras_hub.models.Llama3Tokenizer.from_preset("llama3_instruct_8b_en")
> tokenizer.end_token = '<|end_of_text|>'
> tokenizer.end_token2 = '<|eot_id|>'
# "non-instruct" variant
tokenizer - keras_hub.models.Llama3Tokenizer.from_preset("llama3_8b_en")
> tokenizer.end_token = '<|end_of_text|>'
> tokenizer.end_token2 = '<|eot_id|>'
Which is wrong. The instruct variant end token should be '<|eot_id|>'. Thank to my "end_token2" hack, this will work for inference since both end tokens will stop generation. But for further fine-tuning of the "instruct" variants, the end token is wrong.
Llama 3 uses the
<|end_of_text|>
special token in non-instruct model variants and the<|eot_id|>
special token in "instruct" model versions. Without this fix, in "instruct" model variants, text generation does not stop. This is configured intokenizer_config.json
.For example: Meta-Llama-3-8B-Instruct/tokenizer_config.json :
"bos_token": "<|begin_of_text|>"
"eos_token": "<|eot_id|>"
Meta-Llama-3-8B/tokenizer_config.json :"bos_token": "<|begin_of_text|>"
"eos_token": "<|end_of_text|>"
However, this file does not exist in Keras Llama presets and special tokens are hard-coded in the
Lllama3Tokenizer
constructor using_add_special_token
.Another difference between Keras and Transformers tokenizer presets is that in Transformers, the tokenizer_config.json has a list of special tokens:
<|end_header_id|>
<|start_header_id|>
<|end_of_text|>
<|python_tag|>
<|eom_id|>
<|begin_of_text|>
<|eot_id|>
<|finetune_right_pad_id|>
. Again, this config file does not exist in Keras presets. Instead, special tokens can be found directly in the vocabulary. And it's not exactly the same list:<|end_header_id|>
<|start_header_id|>
<|end_of_text|>
<|begin_of_text|>
<|eot_id|>
In light of this discrepancies, this fix does the following:
1) It keeps the "special tokens hardcoded in constructor through
_add_special_token
" approach but expands the list to match all the special tokens in the Keras' Llama3 vocabulary:<|end_header_id|>
<|start_header_id|>
<|end_of_text|>
<|begin_of_text|>
<|eot_id|>
2) When converting from a Transformers checkpoint, it adds special tokens to the vocabulary to that tokenization works for them.
3) I declares both
<|end_of_text|>
and<|eot_id|>
as stopping tokens for generation in all cases. It would have been possible to fix this properly on the Transformers side by reading the configured eos_token from the config, but generation from Instruct variants loaded from Keras presets would have remained broken.