keras-team / keras-hub

Pretrained model hub for Keras 3
Apache License 2.0
804 stars 243 forks source link

fix for generation that never stops in Llama3-Instruct variants #1904

Closed martin-gorner closed 1 month ago

martin-gorner commented 1 month ago

Llama 3 uses the <|end_of_text|> special token in non-instruct model variants and the <|eot_id|> special token in "instruct" model versions. Without this fix, in "instruct" model variants, text generation does not stop. This is configured in tokenizer_config.json.

For example: Meta-Llama-3-8B-Instruct/tokenizer_config.json :"bos_token": "<|begin_of_text|>" "eos_token": "<|eot_id|>" Meta-Llama-3-8B/tokenizer_config.json :"bos_token": "<|begin_of_text|>" "eos_token": "<|end_of_text|>"

However, this file does not exist in Keras Llama presets and special tokens are hard-coded in the Lllama3Tokenizer constructor using _add_special_token.

Another difference between Keras and Transformers tokenizer presets is that in Transformers, the tokenizer_config.json has a list of special tokens: <|end_header_id|> <|start_header_id|> <|end_of_text|> <|python_tag|> <|eom_id|> <|begin_of_text|> <|eot_id|> <|finetune_right_pad_id|>. Again, this config file does not exist in Keras presets. Instead, special tokens can be found directly in the vocabulary. And it's not exactly the same list: <|end_header_id|> <|start_header_id|> <|end_of_text|> <|begin_of_text|> <|eot_id|>

In light of this discrepancies, this fix does the following:

1) It keeps the "special tokens hardcoded in constructor through _add_special_token" approach but expands the list to match all the special tokens in the Keras' Llama3 vocabulary: <|end_header_id|> <|start_header_id|> <|end_of_text|> <|begin_of_text|> <|eot_id|>

2) When converting from a Transformers checkpoint, it adds special tokens to the vocabulary to that tokenization works for them.

3) I declares both <|end_of_text|> and <|eot_id|> as stopping tokens for generation in all cases. It would have been possible to fix this properly on the Transformers side by reading the configured eos_token from the config, but generation from Instruct variants loaded from Keras presets would have remained broken.

martin-gorner commented 1 month ago

There seems to be a problem left: model.preprocessor("A") appends the end_token_id to the tokenized output. This will not be the correct id for "Instruct" Llama variants. Any guidance on how to address this properly?

martin-gorner commented 1 month ago

More issues found. The output is different when using the HF vs. the Keras tokenizer. Special tokens are not being parsed by the Keras preprocessor.

txt = "<|start_header_id|>system<|end_header_id|>hello<|end_of_text|>"

tok = transformers.AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
print("HF tokenizer\n", tok(txt).input_ids)

prepro = keras_hub.models.Llama3CausalLMPreprocessor.from_preset("hf://meta-llama/Llama-3.2-1B-Instruct")
print("Keras proprocessor\n",prepro(txt)[0]["token_ids"][:40])

output:

# 128000, 128006, 128007, 128001 are the expected special tokens

HF tokenizer output:
 [128000, 128006, 9125, 128007, 15339, 128001]

Keras preprocessor output:
 [128000     27     91   2527     62   2775     62    307     91     29
   9125     27     91    408     62   2775     62    307     91     29
  15339     27     91    408     62   1073     62   1342     91     29
 128001      0      0      0      0      0      0      0      0      0]

This happens with or without the present PR. Investigating...

martin-gorner commented 1 month ago

Added a fix for the preprocessing of special tokens. Added the loading of the correct eos_token from the Hugging Face config. The info unfortunately does nto exist in the Keras config so a hack is used to make generation work in all cases. Last remaining problem: The packer in the preprocesor: model.preprocessor("A") appends the end_token_id to the tokenized output. It will be the wrong end_token for Llama3 "instruct" variants loaded from Keras checkpoints.

divyashreepathihalli commented 1 month ago

Thanks Martin! the presets would need to be regenerated after this, correct?

martin-gorner commented 1 month ago

My PR does not change anything to Keras Llama presets. They still have their built-in end_token, which is wrong for -instruct variants. I just added a hack that declares both possible end_tokens as stop characters for generation so that generation at least stops.

If you have a way of storing the correct end_token in the preset, that would be a good fox for the Keras version indeed. Is that possible? How would versioning work? All Llama3-instruct models need this: Llama2, Llama3.0, Llama3.1

martin-gorner commented 1 month ago

The Deeplabv3 test failure does not seem to be related to this PR... There isn't much I can do to fix it.

divyashreepathihalli commented 1 month ago

If you have a way of storing the correct end_token in the preset, that would be a good fox for the Keras version indeed. Is that possible? How would versioning work? All Llama3-instruct models need this: Llama2, Llama3.0, Llama3.1 I think this would be a good thing to add for tokenizer.json in presets. Regarding versioning, we can upload new presets files on Kaggle and the new version number needs to be updated in the presets file for the model.

SamanehSaadat commented 1 month ago

Hey @martin-gorner! Why aren't we hard coding all the special tokens in the Llama tokenizer? We already know what all the special tokens are and we don't expect them to change, right? So why not hard-code them all? Wouldn't that fix the issue for both loading from a Keras and a HF checkpoint?

martin-gorner commented 1 month ago

Hardcoded or loaded from a fixed preset for all Llamas does not make much difference. The difficulty is to select the correct "end of text" token for instruct and non-instruct models. This is easy in the Hugging Face approach where each model has its own config file.

For example, the current behavior in Keras is:

# "instruct" variant
tokenizer - keras_hub.models.Llama3Tokenizer.from_preset("llama3_instruct_8b_en")
> tokenizer.end_token = '<|end_of_text|>'
> tokenizer.end_token2 = '<|eot_id|>'

# "non-instruct" variant
tokenizer - keras_hub.models.Llama3Tokenizer.from_preset("llama3_8b_en")
> tokenizer.end_token = '<|end_of_text|>'
> tokenizer.end_token2 = '<|eot_id|>'

Which is wrong. The instruct variant end token should be '<|eot_id|>'. Thank to my "end_token2" hack, this will work for inference since both end tokens will stop generation. But for further fine-tuning of the "instruct" variants, the end token is wrong.