No speed up - Githubissues

Louis-y-nlp commented 7 months ago

Thanks for your work. I used your demo code, but I did not observe any speed improvement; instead, I noticed a decrease in speed. I used a V100-32G GPU and ran the minimal.py on a finetuned llama2-13b model.

root@f07b9fe29941:/home/work/data/codes/LookaheadDecoding# CUDA_VISIBLE_DEVICES=4,5 python minimal.py 
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:29<00:00,  9.71s/it]
Output:
----------------------------------------------------------------------------------------------------
<s> <|im_start|>user
How do you fine tune a large language model?<|im_end|>
<|im_start|>assistant
Fine-tuning a large language model involves taking a pre-trained model and training it on a specific task or dataset. Here are the general steps to fine-tune a large language model:

1. Choose a pre-trained language model: There are many pre-trained language models available, such as GPT-3, BERT, and XLNet. Choose the one that best fits your needs.
2. Prepare your dataset: Your dataset should be relevant to the task you want to train the model on. It should also be properly formatted and preprocessed.
3. Adjust the model's architecture: Depending on your task, you may need to adjust the model's architecture to better fit your needs. For example, you may want to add or remove layers, or change the model's output size.
4. Train the model: Use your dataset to train the model. You may need to experiment with different hyperparameters, such as learning rate and batch size, to get the best results.
5. Evaluate the model: Once you've trained the model, evaluate its performance on a validation set. You may need to fine-tune the model further
Generated Tokens: 256 Generation Speed:  18.18366294800378  tokens/s
root@f07b9fe29941:/home/work/data/codes/LookaheadDecoding# CUDA_VISIBLE_DEVICES=4,5 python minimal.py 
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:29<00:00,  9.95s/it]
Output:
----------------------------------------------------------------------------------------------------
<s> <|im_start|>user
How do you fine tune a large language model?<|im_end|>
<|im_start|>assistant
Fine-tuning a large language model involves taking a pre-trained model and training it on a specific task or dataset. Here are the general steps to fine-tune a large language model:

1. Choose a pre-trained language model: There are many pre-trained language models available, such as GPT-3, BERT, and XLNet. Choose the one that best fits your needs.
2. Prepare your dataset: Your dataset should be relevant to the task you want to train the model on. It should also be properly formatted and preprocessed.
3. Adjust the model's architecture: Depending on your task, you may need to adjust the model's architecture to better fit your needs. For example, you may want to add or remove layers, or change the model's output size.
4. Train the model: Use your dataset to train the model. You may need to experiment with different hyperparameters, such as learning rate and batch size, to get the best results.
5. Evaluate the model: Once you've trained the model, evaluate its performance on a validation set. You may need to fine-tune the model further
Generated Tokens: 256 Generation Speed:  17.99658773718007  tokens/s
root@f07b9fe29941:/home/work/data/codes/LookaheadDecoding# CUDA_VISIBLE_DEVICES=4,5 USE_LADE=1 LOAD_LADE=1 python minimal.py 
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:30<00:00, 10.23s/it]

==========================ACCELERATION===SUMMARY======================================
Generated tokens:  1 Total steps:  1  Compression ratio:  1.0
======================================================================================
==========================ACCELERATION===SUMMARY======================================
Generated tokens:  256 Total steps:  133  Compression ratio:  1.92
======================================================================================Output:
----------------------------------------------------------------------------------------------------
<s> <|im_start|>user
How do you fine tune a large language model?<|im_end|>
<|im_start|>assistant
Fine-tuning a large language model involves taking a pre-trained model and training it on a specific task or dataset. Here are the general steps to fine-tune a large language model:

1. Choose a pre-trained language model: There are many pre-trained language models available, such as GPT-3, BERT, and XLNet. Choose the one that best fits your needs.
2. Prepare your dataset: Your dataset should be relevant to the task you want to fine-tune the model on. You'll need to split your dataset into training and validation sets.
3. Adapt the model: You'll need to modify the pre-trained model to fit your specific task. This can involve adding or removing layers, changing the model architecture, or adjusting the model's hyperparameters.
4. Train the model: Use your dataset to train the modified model. You'll need to choose an appropriate optimization algorithm, learning rate, and batch size.
5. Evaluate the model: Use your validation set to evaluate the performance of your model. You can use metrics such as accuracy, F1 score, or perplexity to measure
Generated Tokens: 256 Generation Speed:  15.15100490484271  tokens/s
root@f07b9fe29941:/home/work/data/codes/LookaheadDecoding# CUDA_VISIBLE_DEVICES=4,5 USE_LADE=1 LOAD_LADE=1 python minimal.py 
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:29<00:00,  9.90s/it]

==========================ACCELERATION===SUMMARY======================================
Generated tokens:  1 Total steps:  1  Compression ratio:  1.0
======================================================================================
==========================ACCELERATION===SUMMARY======================================
Generated tokens:  256 Total steps:  133  Compression ratio:  1.92
======================================================================================Output:
----------------------------------------------------------------------------------------------------
<s> <|im_start|>user
How do you fine tune a large language model?<|im_end|>
<|im_start|>assistant
Fine-tuning a large language model involves taking a pre-trained model and training it on a specific task or dataset. Here are the general steps to fine-tune a large language model:

1. Choose a pre-trained language model: There are many pre-trained language models available, such as GPT-3, BERT, and XLNet. Choose the one that best fits your needs.
2. Prepare your dataset: Your dataset should be relevant to the task you want to fine-tune the model on. You'll need to split your dataset into training and validation sets.
3. Adapt the model: You'll need to modify the pre-trained model to fit your specific task. This can involve adding or removing layers, changing the model architecture, or adjusting the model's hyperparameters.
4. Train the model: Use your dataset to train the modified model. You'll need to choose an appropriate optimization algorithm, learning rate, and batch size.
5. Evaluate the model: Use your validation set to evaluate the performance of your model. You can use metrics such as accuracy, F1 score, or perplexity to measure
Generated Tokens: 256 Generation Speed:  15.352442416678569  tokens/s

Here is my requirements infomation:

# pip list | grep transformers
sentence-transformers         2.2.2
transformers                  4.34.0
transformers-stream-generator 0.0.4
# pip list | grep accelerate
accelerate                    0.23.0
# pip list | grep torch
torch                         2.0.1

Viol2000 commented 7 months ago

I think you can use one 32GB GPU to hold an fp16 version of the llama-2-13b model. And you can set lade.config_lade(LEVEL=5, WINDOW_SIZE=10, GUESS_SET_SIZE=10, DEBUG=1) or something smaller for a 13b model. The default setting is too costy for a 13b model.

yhyu13 commented 7 months ago

@Viol2000 What key factor that would derive a better configuration for models with different sizes? Like you've motioned above

Viol2000 commented 7 months ago

Hi @yhyu13 . You can check the table 1 in our blog. We require large extra flops to predict tokens. When the GPU is weak or the model is larger, we need to reduce this cost (and also, we will predict fewer tokens), or it will bring a slowdown.

Louis-y-nlp commented 7 months ago

@Viol2000 Thank you for your assistance. After adjusting the parameters, I observed a slight improvement in inference speed, increasing from 18token/s to approximately 21token/s.

Viol2000 commented 7 months ago

It seems the speedup is still very low. Maybe adjusting the super-parameters can help further. However, I think the main reason is that V100 does not have extra flops to do a heavy lookahead and verification branch for the 13b model.

hao-ai-lab / LookaheadDecoding

No speed up #19