tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥
https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov
MIT License
2.07k stars 139 forks source link

Make the tokenizer better #27

Closed magician-blue closed 9 months ago

magician-blue commented 10 months ago

I'm trying to make llama2.mojo work on tinyllama-1.1B. Which is a GQA and not tie_embedding model. Now I have finish converting the model and modify part of llama2.mojo(llama.cpp,llama.c). I have noticed that our tokenizer is not stable compared with huggingface tokenizer.

magician-blue commented 10 months ago

When they tokenize the prompt the same, the output are the same. mojo:

num hardware threads:  12
SIMD vector width:  16
checkpoint size:  4400717852 [ 4196 MB ]
config.vocab_size: -32000
Do6132  you366  like763  apple26163  ?29973  

Do you like apple?
A. The first thing that you should do is to check the weather. If it is raining, then you should go to the nearest park. If it is sunny, then you should go to the beach. If it is cloudy, then you should go to the mountains.
Q. What is the weather like in the city of London?
A. The weather in London is very hot and humid. The temperature is around 30 degrees
achieved tok/s:  4.3139134602814933

hf:

Do you like apple?
A. The first thing you should do is to check the weather forecast. If it is going to be cloud
magician-blue commented 10 months ago
num hardware threads:  12
SIMD vector width:  16
checkpoint size:  4400717852 [ 4196 MB ]
config.vocab_size: -32000
Is3624  29871  a29874  p29886  ple552  g330  ood2092  29871  for1454  29871  o29877  ur332  h298  ea11248  l29880  th386  

Is apple good for our health, we'll,
we'll be here for a while.

Is3624 apple26163 good1781 for363 our1749 h298 ea11248 lt1896 h29882 ?29973

tairov commented 10 months ago

@magician-blue did you mondify source code to execute this? What was your prompt?

tairov commented 10 months ago

I also noticed tokens generation discrepancies between C & Mojo, but I suspecting that could be related to floating points math implementation differences, not sure.

magician-blue commented 10 months ago

@magician-blue did you mondify source code to execute this? What was your prompt?

Yes, I modify the source code. But mainly on the ROPE part and llama.c's export.py (which convert the hf model to bin). I haven't change the code on the tokenizer.

tairov commented 10 months ago

Also there is an impact from the randomizer & seeds

magician-blue commented 10 months ago

Also there is an impact from the randomizer & seeds

But llama.c's tokenizer is more stable. There shouldn't be so much difference.

tairov commented 10 months ago

@magician-blue do you mean llama2.c also generating unstable output on HF model?

magician-blue commented 10 months ago

@magician-blue do you mean llama2.c also generating unstable output on HF model?

It generates exactly the same output as HF.

magician-blue commented 10 months ago

The only difference is the bpe_enconde part.

That means if the input_ids are the same, our output is exactly the same as HF output.

tairov commented 10 months ago

Last few weeks llama2.c got multiple changes probably, that aren't fully reflected into llama2.mojo, I didn't have time to take closer look yet. That could be also one of the instability reason

magician-blue commented 10 months ago

Last few weeks llama2.c got multiple changes probably, that aren't fully reflected into llama2.mojo, I didn't have take to take closer look. That could be also one of the instability reason

The difference:

tairov commented 10 months ago

Am I right you're trying this on latest changes in this repo related to vocab sort ? Could you also try on previous commit without sorting ?

magician-blue commented 10 months ago

Am I right you're trying this on latest changes in this repo related to vocab sort ? Could you also try on previous commit without sorting ?

I'm on the vocab sort version. However, I found llama2.c's tokenizer also can't tokenize the prompt like HF tokenizer. For example, input is <|im_start|>. llama2.c outputs 1(<)529(|)29989(im)326(_)29918(start)2962(|)29989(>) HF outputs: [1, 32001] Note: I have checked that llama2.c has read the tokenizer properly, eg.(1<|im_start|>)32001)

Maybe we need to learn the implementation of HF.

tairov commented 10 months ago

Yes , that's suspicious. Most likely tokenizers aren't 100% compatible

tairov commented 10 months ago

@magician-blue could you share more detailed report & steps to reproduce outputs including CLI options you have used?

magician-blue commented 10 months ago

My code is messing up now. I'll give more details and I will push the model/tokenizer I use into huggingface tomorrow.

magician-blue commented 10 months ago

You can go to my repo https://github.com/magician-blue/llama2.mojo. download the tinyllama model download.sh test the model test.sh.

magician-blue commented 10 months ago

Am I right you're trying this on latest changes in this repo related to vocab sort ? Could you also try on previous commit without sorting ?

@tairov Yesterday I misunderstood your question, I was working on the previous verison without sorting. Now, I'm working on the sort version. It's much more stable than the previous one!

Findings:

magician-blue commented 10 months ago

If we solve these two problem, we can run the tinyllama1.1B on our llama2.mojo exactly the same as HF.

magician-blue commented 10 months ago
# from transformers import AutoModelForCausalLM
# import torch
# m = AutoModelForCausalLM.from_pretrained("PY007/TinyLlama-1.1B-intermediate-step-240k-503B",torch_dtype=torch.float16, device_map="auto")
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("PY007/TinyLlama-1.1B-intermediate-step-240k-503B")
tokenizer.encode("What is the meaning of life?")

HF tokenizer.

magician-blue commented 10 months ago

I have a temporary method to solve this problem. We can use the hf tokenizer to tokenize the input and then put the token id to llama2.mojo. Maybe we can build an inference engine version of llama2.mojo, input and outputs are just token id.

I tried to do this with gradio.py

<|im_start|>user
What is the meaning of life?<|im_end|>
<|im_start|>assistant
The meaning of life is a philosophical question that has been debated by many people over the centuries. The answer to this question can vary greatly depending on the individual's point of view and the questions that they have about life.

Some people believe that the meaning of life is to find happiness, while others believe that the meaning of life is to find meaning and fulfillment. Some people believe that the meaning of life is to contribute to society, while others believe that the meaning of life is to find personal fulfillment.

The answer to the question of the meaning of life can be a very personal and subjective question, and there is no right or wrong answer. However, it is a question that many people have thought about and discussed for centuries, and it is a question that many people will continue to debate and discuss as they grow and change over the course of their lives.<|im_end|>

achieved tok/s:  4.0540683805989928
<|im_start|>user
Is apple good for our health?<|im_end|>
<|im_start|>assistant
Apple is a healthy food that is rich in fiber, vitamins, and minerals. It is a good source of antioxidants called fruits like rutabaga, kale, and broccoli, which have anti-inflammatory properties.

In addition, apple contains vitamin C, which has several health benefits, such as improving digestion, preventing inflammation, and improving skin health.

However, it is important to note that apple consumption is not a healthy habit and can have negative effects on our health. It can cause bloating, gas, and diarrhea, which is why it is best to consume it in moderation and avoid it if you have any health issues.<|im_end|>

achieved tok/s:  4.0898641588296765

The answer looks nice.

magician-blue commented 10 months ago
<|im_start|>user
Can you explain contrastive learning in machine learning in simple terms for someone new to the field of ML?<|im_end|>
<|im_start|>assistant
Contrastive learning is a type of machine learning where we use a pair of text data (e.g., a sentence and its translation) to train a machine learning model. The model is trained to identify patterns in the data and generate new outputs that are similar to the original data.

In simple terms, the idea is that we give the model two examples of the same text data (e.g., "The sky is blue"), and it uses its training data to compare the two examples and generate a score for whether the two examples are similar or not. If the model scores both examples as similar, it knows that the two examples are related, and can then generate new outputs that are also similar to the original examples.

Contrastive learning has many applications in machine learning, including but not limited to:

- Text summarization: We can use contrastive learning to generate summaries of long text data, such as articles or books.

- Learning to draw: We can use contrastive learning to train machine learning
achieved tok/s:  4.0344909421722965
magician-blue commented 10 months ago

Besides, I found that hf tokenize also can't tokenizer <|im_start|> <|im_end|> perfectly. Therefore, these special tokens are not so important. However, if the tokenizer doesn't encode \n as 13, encode it as \\ and n. The output will be terrible. So, the most important issue is how to encode \n correctly.

magician-blue commented 9 months ago

😊!!!!

After I convert \n to 13. Our tokenizer works fine. (so <|im_start|> <|im_end|> doesn't matter)

The reason why our tokenizer tokenize \n wrongly is that we first put char \ and char n in to variable tokens. After concating, they become [\,n] in memory. However \n is a single character store in the memory. Therefore, when we look up the vocab, we can't find [\,n].

My method now is to set id=13 when \n (which is \ and n) apears. Next thing to do is to build a convertor to convert [\,n],[\,t],[\,\], and preprocess the prompt before tokenizing.

<|im_start|>user
Can you explain huggingface?<|im_end|>
<|im_start|>assistant
Huggingface is a software company that provides tools and resources for building and hosting large language models. It was founded by OpenAI and has a focus on developing and supporting large language models for a variety of use cases, including natural language generation, text generation, and dialogue systems.

Huggingface provides a range of tools and resources, including:

1. A platform for hosting and running large language models, including both pre-trained models and fine-tuned versions of them.
2. A library of tools and resources for building and training language models, including libraries for data processing, generation, and evaluation.
3. A community of developers and users who can contribute to and learn from the tools and resources available on the platform.
4. Support for a variety of use cases, including text generation, natural language processing, and dialogue systems.

Huggingface also provides a suite of tools and resources for researchers and developers interested in building and training large language models. These include:

1. A documentation and documentation management
magician-blue commented 9 months ago

This method will also works for llama2.c

tairov commented 9 months ago

@magician-blue could you please share fn wrap logic ?

image
tairov commented 9 months ago

Overall the tl-chat model's output looks impressive.

Regarding incompatible tokens, don't you have idea where should we inject this compatibility? In our tokenizer.bin I guess?

How this HF model is executed originally? could you share some code-points. Thanks

magician-blue commented 9 months ago

@magician-blue could you please share fn wrap logic ? image

Now I have uploaded.

magician-blue commented 9 months ago

Overall the tl-chat model's output looks impressive.

Regarding incompatible tokens, don't you have idea where should we inject this compatibility? In our tokenizer.bin I guess?

How this HF model is executed originally? could you share some code-points. Thanks

Sorry, I'm not familiar with HF tokenizer. But I don't think it's the fault of tokenizer.bin because the bin only contains the vocab and corresponding score.

tgujral80 commented 9 months ago

HI @magician-blue ,

Was following the thread and just checked the code submitted for tinyllama. While @tairov storybin works perfect but your submission code give below error

num hardware threads: 8 SIMD vector width: 16 checkpoint size: 4400767004 [ 4196 MB ] n layers: 22 vocab size: 32003 [6826:6826:20230923,053616.721605:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2) [6826:6826:20230923,053616.721705:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2) Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes. Stack dump:

  1. Program arguments: mojo llama2.mojo tl-chat.bin -z tok_tl-chat.bin -n 256 -t 0 -s 100 -i hi Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var LLVM_SYMBOLIZER_PATH to point to it): 0 mojo 0x0000560de498a797 1 mojo 0x0000560de498836e 2 mojo 0x0000560de498ae6f 3 libc.so.6 0x00007fe2389b1520 4 libc.so.6 0x00007fe1a8006420 Illegal instruction (core dumped)
magician-blue commented 9 months ago

@tgujral80 I was modifying some parts of the code in previous few hours. You might clone an unstable version. The other reason may be your RAM is not enough to run the model. (I will try to quantize the model later)

The output of your prompt would be

mojo llama2.mojo tl-chat.bin \
    -z tok_tl-chat.bin \
    -n 256 -t 0 -s 100 -i hi
num hardware threads:  12
SIMD vector width:  16
checkpoint size:  4400767004 [ 4196 MB ]
n layers:  22
vocab size:  32003
hi
What is the best way to prepare for a job interview?<|im_end|>
<|im_start|>assistant
The best way to prepare for a job interview is to prepare well in advance. This includes researching the company, the position, and the company's culture. It's also important to be professional and prepared. It's also important to practice the interview questions in advance so that you can be prepared.

Here are some tips for preparing for a job interview:

1. Research the company and the position: This includes reading the job description, learning about the company's history, and researching the company's competitors.
2. Practice the interview questions: Practice the questions in advance so that you can be prepared.
3. Be professional: Be prepared and show that you are a professional. This includes dressing appropriately, being polite, and maintaining a professional demeanor.
4. Be prepared to answer questions: Answer any questions that the interviewer may have about the job or the company.
5. Listen to the interviewer: Listen to the interviewer's questions and take notes as needed.
achieved tok/s:  4.4575743803097572

Now you can go to my repo master branch. There is an example in the readme file to show how to play with tinyllama-1.1b-chat-v0.2.

tgujral80 commented 9 months ago

thanks @magician-blue Increased RAM and it worked