keras-team / keras-nlp

Modular Natural Language Processing workflows with Keras
Apache License 2.0
734 stars 216 forks source link

Add phi3 #1597

Closed abuelnasr0 closed 1 month ago

abuelnasr0 commented 2 months ago

This PR adds phi3 model. The PR is not ready for merge yet I need to:

EDIT: It's ready for review now.

abuelnasr0 commented 2 months ago

Phi3Backbone is now ready to be reviewed and merged!

I have checked numerics for both models phi3_mini_4k_instruct_en and phi3_mini_128k_instruct_en and it produce good mean differnce between the two outputs for float32 and float16 ( i.e. 5.2247e-08 ), but for bfloat16 it produces a less quality result (i.e. -0.0004). I think this is normal, isn't it? here is a notebook with the script run on the two models https://www.kaggle.com/code/mohamedabuelnasr/phi3-keras-conversion

tirthasheshpatel commented 2 months ago

Thanks for the work on this @abuelnasr0, this is awesome!

I have checked numerics for both models phi3_mini_4k_instruct_en and phi3_mini_128k_instruct_en and it produce good mean differnce between the two outputs for float32 and float16 ( i.e. 5.2247e-08 ), but for bfloat16 it produces a less quality result (i.e. -0.0004). I think this is normal, isn't it?

Yes, that's pretty good. Are these absolute tolerence values or relative? Either way, it's below the machine precision of 32-bit and 16-bit floating values so they definitely match!

abuelnasr0 commented 2 months ago

@tirthasheshpatel

Are these absolute tolerence values or relative?

It was supposed to calculate the absolute difference, but I took a look again at the function and I found that I was calculating the mean of the difference not the absolute difference. I have corrected it and run the conversion script again. and the results become worse but acceptable for float32 (3.0725e-06), but for bfloat it is bad (0.0254 for 128k model and 0.0469 for 4k model). and for float16 (0.0046).

abuelnasr0 commented 2 months ago

The model is ready now. And the output matches the huggingface output for the two models.

4k-phi3-model

128k-phi3-model

but there is a problem with the tokenizer described here:

text = "<|user|>\nHow to win?<|end|>\n<|assistant|>"
# the output after adding special_tokens as user_defined_symbols to the sentence_piece model.
keras_nlp  : [1, 29871, 32010, 13, 5328, 304, 5401, 29973, 32007, 13, 32001]
# same as keras but without adding '▁' at the beginning. can be configured in the spm model.
llama_cpp  : [1, 32010, 13, 5328, 304, 5401, 29973, 32007, 13, 32001] 
# Removes '\n' (LF token) completly. 
# Adds '▁' at the beginning (If text starts with non-special token) and after each special token.
hf_fast_tok: [1, 32010, 1128, 304, 5401, 29973, 32007, 32001]
# Removes '\n' (LF token) completly. Adds '▁' at the beginning.
# Same as keras but if the text doesn't contain '\n'.
hf_tok     : [1, 29871, 32010, 5328, 304, 5401, 29973, 32007, 32001] 

The huggingface output should match the sentencepiece output after adding the special tokens, but huggingface handles special tokens outside the sentencepiece library, that's why output doesn't much.

LlamaTokenizer and LlamaFastTokenizer in huggingface aren't consistent. But if we have to match huggingface output we should try to match LlamaFastTokenizer as it is used in the example of the official model page, but we will do a lot of work around, for example like here https://github.com/keras-team/keras-nlp/pull/1445

NOTE: The generation match in the photos because I used a text that is tokenized the same in keras_nlp and huggingface using LlamaTokenizer

mattdangerw commented 1 month ago

Copying the presets over now on Kaggle. I will pull this in today.

mattdangerw commented 1 month ago

Updated links, though I think things are still processing Kaggle side.