pytorch-labs / gpt-fast

Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
BSD 3-Clause "New" or "Revised" License
5.56k stars 505 forks source link

Fix int4 quantization #152

Closed malfet closed 5 months ago

malfet commented 5 months ago

Discovered by @HDCharles

Test plan:

% python3 quantize.py --checkpoint_path checkpoints/openlm-research/open_llama_7b/model.pth --mode int4 --device cuda
% python3 generate.py --checkpoint_path checkpoints/openlm-research/open_llama_7b/model_int4.g32.cuda.pth --prompt "Once upon a time" --device cuda
...
Using int4 weight-only quantization!
Time to load model: 3.20 seconds
Once upon a time I was a kid. And that kid, as I understand, went through a phase as a teen where he binge watched a whole bunch of movies. I don’t remember the exact number, but it seems like at least 50 movies in succession. I read somewhere that people would record movies on VHS tapes and then binge watched them, so maybe that’s what this kid was doing. I also read somewhere that the person had never binge watched 50 movies in succession again.
That’s the truth and it’s a shame. That’s how you know the world is changing in a horrible way. The binge watcher, the VHS watcher, the guy who turns a whole bunch of movies into a marathon and then stops. The person who made that guy stop. That’s why I’m writing this: to prevent you from reading this, and I’m sorry. I’m sorry that you’ll never turn
Time for inference 1: 8.27 sec total, 24.17 tokens/sec
Bandwidth achieved: 106.17 GB/s

and


% python3 quantize.py --checkpoint_path checkpoints/openlm-research/open_llama_7b/model.pth --mode int4 --device cpu
% python3 generate.py --checkpoint_path checkpoints/openlm-research/open_llama_7b/model_int4.g32.cpu.pth --prompt "Once upon a time" --device cpu
...
Using int4 weight-only quantization!
Time to load model: 0.09 seconds
Once upon a time, I was ith the new movie.
Welcome to the third installment of the Once Upon a Time! series.
This time around, I’ve decided to focus on a movie that has had its fair share of publicity and fame, but one that I was not familiar with before.
The movie in question is the 2004 remake of the classic fairy tale The Three Little Pigs, which was released the same year as Pirates of the Caribbean: The Curse of the Black Pearl and the 2007 adaptation of the classic novel The Lion King.
It was the first film in the Once Upon a Time! series that I had not seen, and as such, I was only familiar with the first half of the story.
I was intrigued by the story, and I knew that I would be interested in seeing the movie when I was able.
I had watched a bunch of trailers and clips to get an idea of what the movie was going
Time for inference 2: 27.75 sec total, 7.21 tokens/sec
Bandwidth achieved: 31.65 GB/s