pytorch / torchchat

Run PyTorch LLMs locally on servers, desktop and mobile
BSD 3-Clause "New" or "Revised" License
3.32k stars 212 forks source link

[LAUNCH BLOCKER][AARCH64] Quantization produces bad results #445

Closed metascroy closed 5 months ago

metascroy commented 6 months ago

Steps to reproduce:

  1. Download Llama-2-7b-chat-hf to .model-artifacts/meta-llama
python torchchat.py download llama2 
  1. Convert tokenizer.model to tokenizer.bin

    export TORCHCHAT_ROOT=${PWD}
    pushd .model-artifacts/meta-llama/Llama-2-7b-chat-hf 
    python3 ${TORCHCHAT_ROOT}/utils/tokenizer.py --tokenizer-model=tokenizer.model
    popd
  2. Quantize model and export to AOTI. Produces bad results on both bf16 and fp16.

BF16 repro:

python3 torchchat.py export llama2 --dtype bf16 --quantize '{"embedding" : {"bitwidth": 8, "groupsize": 0}, "linear:int4" : {"groupsize": 32}}' --output-dso-path ./model.so

FP16 repro:

python3 torchchat.py export llama2 --dtype fp16 --quantize '{"embedding" : {"bitwidth": 8, "groupsize": 0}, "linear:int4" : {"groupsize": 32}}' --output-dso-path ./model.so
  1. Run the model

generate mode:

./runner-aoti/cmake-out/run ./model.so -z ./.model-artifacts/meta-llama/Llama-2-7b-chat-hf/tokenizer.bin -i "Once upon a t
ime"

Output: Once upon a time newsomorphic weitereÃiffeمomorphic jednakebenéglise Уи prima^C

chat mode:

./runner-aoti/cmake-out/run ./model.so -z ./.model-artifacts/meta-llama/Llama-2-7b-chat-hf/tokenizer.bin -m chat

Output: Enter system prompt (optional): User: How are you? Assistant: `<щаelességlise associatedlib apéglise associated vale Possibleეaby months间 Crimeéglise ProfessionalThrowéglise龙iegeléglisetrpy responsevaluemységliseabe Guardiegelteil zuaw whoseajoresource singingkiramaMartin stadéglise cookiesiegeléglisebareiegel manyégliseiegeléglisepisjsloss Decста Editionственных華',égliselaps vari loopsiegel (« astronom Entityicamentequeessemijiegel п toutes Heartspecific as picturescraftiegeléglise souségliseanta reflectioniegeltxldGlobaliegel vsiegelженаmatrixformiegel value manière loségliseNumber paździer l *iegeliegeliegel石églisequantity Reildégliseccioniegel Robertsufflepo>рі need bonuscharts߬ fundségliseégliseCancel finden tryo shoot Facebookiegeliegelhe helpingposition Eliegel Lauf Tout

mikekgfb commented 6 months ago

On Linux x86? Or macOS with ARM?

cc: @malfet @desertfire

metascroy commented 6 months ago

macOS with Arm

malfet commented 6 months ago

@metascroy can you confirm that the same works fine without AOTI?

malfet commented 6 months ago

Ok, you don't need to go all the way to llama2, even stories15M generate gibberish with this combination of quantization parameters:

% python3 torchchat.py generate --quantize '{"embedding" : {"bitwidth": 8, "groupsize": 0}, "linear:int4" : {"groupsize": 32}}'  --checkpoint-path checkpoints/stories15M/model.pth 

Hello, my name is and could theon. the dropped. the.., man his the there the, sound a the where it somethingtime such that happen something she a’ all time,'', was what time. together howd' a a luck open. could’den it'’ to to so filledgle'' could open hadets buets-enets and to smiled. wasenets backets kept hadets she in apartets,est used was’ coulders wasrell-est. poets andets, lidersets was times. toieling was times. toiealing itastic back every, need the. toieling if was again have outside', box."ling coulders freeets to. wasets to on, must. was toieling if has in the. to, sc the. before if outside. when remembered the remembered the. then thatets loose. before., need the. before. before. before. ran if.

Same command on x86 system:

Hello, my name is Laura. One day,aring an adorable little girl, Laura was playing outside. She was having so much fun! But then, it started to rain. Laura was so sad because she didn't have an umbrella and she wanted to go home.
Suddenly, Laura heard a voice. "Don't worry, today I can help!" It was a kind grumpy neighbor who had seen Laura's tears in the rain. She smiled and waved at her neighbor. 
"Here, take my umbrella! It can help me get away from the rain," said Laura, offering it to her neighbor. 
The neighbor smiled and thanked Laura. He picked up the umbrella and went off to find some other way home. 
The rain stopped, but Laura was happy to have helped. She waved goodbye to her neighbor and went back inside. 
It turned out to be an even better day after
malfet commented 6 months ago

Looks like things are getting better once one drops "embedding" : {"bitwidth": 8, "groupsize": 0}, part of the quantization schema

metascroy commented 6 months ago

@malfet I haven't tried without AOTI. I came across this while doing https://github.com/pytorch/torchchat/issues/434. Things also get better when you drop BF16 and keep "embedding" : {"bitwidth": 8, "groupsize": 0}, so it appears to be those two things together.

metascroy commented 6 months ago

It works in eager mode on M1 mac:

(cchat2) scroy@scroy-mbp torchchat % python torchchat.py chat llama2 --dtype bf16 --quantize '{"embedding" : {"bitwidth": 8, "groupsize": 0}, "linear:int4" : {"groupsize": 32}}'
usage: torchchat [-h]

Welcome to the torchchat CLI!

options:
  -h, --help  show this help message and exit
Using device=cpu Apple M1 Pro
Loading model ...
name 7B
Time to load model: 14.80 seconds
Quantizing the model with: {"embedding" : {"bitwidth": 8, "groupsize": 0}, "linear:int4" : {"groupsize": 32}}
Time to quantize model: 31.53 seconds
Entering Chat Mode. Will continue chatting back and forth with the language model until the models max context length of 2048 tokens is hit or until the user says /bye
Do you want to enter a system prompt? Enter y for yes and anything else for no. 
y
What is your system prompt? 
Pretend we're in wonderland.
What is your prompt? 
Isn't that strange.
 Oh my goodness, oh me! *adjusts top hat* Are you trying to strip the stripes off of my dear friend the Cheshire Cat? *pauses and looks around* Hmm, I suppose you mean
metascroy commented 6 months ago

@malfet I also see bad results with FP16. Added repro command for that in step 3.

malfet commented 6 months ago

@metascroy can you please share PyTorch version you are using?

metascroy commented 6 months ago

@malfet I'm using version '2.4.0.dev20240422'. Let me try reinstalling on a new conda environment and see if things resolve.

malfet commented 6 months ago

Ok, it looks to me as a problem with tokenizer (in aoti-run) rather than with quantization algorithms as I can get sane output using the same dso with generate, but get garbage with aoti-runner:

% python3 torchchat.py export stories110M --dtype fp16 --quantize '{"embedding" : {"bitwidth": 8, "groupsize": 0}, "linear:int4" : {"groupsize": 32}}' --output-dso-path ./model.so 
% ./cmake-out/aoti_run ./model.so -z ./.model-artifacts/meta-llama/Llama-2-7b-chat-hf/tokenizer.bin -i "Once upon a time"             
Failed to load ./.model-artifacts/meta-llama/Llama-2-7b-chat-hf/tokenizer.bin into a Tiktoken tokenizer. Trying sentencepiece tokenizer..
Once upon a time for Exper Exper Exper Exper introduce Red Exper Exper introduce Exper introduce given introduce given introduce introduce given introduce Red introduce Red introduce Red introduce Red introduce Exper Exper introduce Red introduce Exper introduce Red introduce Red introduce Exper introduce Red Exper Exper introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red Exper introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red Exper org org org org org org org org org org org org org org org org org org org org org org org org org org org org org Exper introduce given given given introduce given givenity introduce Red introduce Red Experound introduce Red introduce Experound introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red Experound Experound introduce givenity introduce givenity introduce givenity introduce givenity introduce Red introduce Red introduce Red introduce Red introduce Red introduce Red introduce
achieved tok/s: 36.669543

vs

% python3 torchchat.py generate  --checkpoint-path .model-artifacts/meta-llama/Llama-2-7b-chat-hf/model.pth --dso-path ./model.so
usage: torchchat [-h]

Welcome to the torchchat CLI!

optional arguments:
  -h, --help  show this help message and exit
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Warning: checkpoint path ignored because an exported DSO or PTE path specified
/Users/nshulga/Library/Python/3.9/lib/python/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
Using device=cpu Apple M1 Pro
Loading model...
known configs: ['13B', '70B', 'CodeLlama-7b-Python-hf', '34B', 'stories42M', '30B', 'stories110M', '7B', 'stories15M', 'Mistral-7B', 'Meta-Llama-3-8B']
Time to load model: 13.47 seconds
Hello, my name is Mommy!" It said.
“Mommy, what are you doing?” asked the little girl.
“I’m working on a cake,” replied Mommy. “It’s going to be so yummy! I’m going to use my scissors to cut the pieces.”
“Mommy, can I help?” asked the little girl.
“Yes, sweetheart. Come here and help me. I need you to hold the scissors for me.”
The little girl held the scissors for her mommy, who was cutting the cake. But when she looked at the cake, she saw something strange. It was covered in something that smelled sour.
“Mommy, what is that?” asked the little girl.
“It’s a sour cake,” said Mommy. “I need to add something sweet to it. Can you help me?”
The
Max Sequence Length Reached. Ending Conversation.
==========
Average tokens/sec: 25.76
Memory used: 0.00 GB
metascroy commented 5 months ago

Thanks! I probably should have realized this (face palm). It looks like an issue with fp16/bf16, not quantization per se (see repros below). Perhaps this line in run.cpp is an issue?

float* logits = forward(transformer, token, pos);

We might have a bad conversion going on since bf16/fp16 are not native c types.

Repros:

python torchchat.py export stories15M --dtype bf16 --output-dso-path ./model_bf16.so
python torchchat.py export stories15M --dtype fp16 --output-dso-path ./model_fp16.so
python torchchat.py export stories15M --dtype fp32 --output-dso-path ./model_fp32.so

FP32:

./cmake-out/aoti_run ./model_fp32.so -z ./.model-artifacts/meta-llama/Llama-2-7b-chat-hf/tokenizer.bin -t 0 -i "Once upon a time"
Failed to load ./.model-artifacts/meta-llama/Llama-2-7b-chat-hf/tokenizer.bin into a Tiktoken tokenizer. Trying sentencepiece tokenizer..
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red ball in the sky. It was the sun! She thought it was so pretty.
Lily wanted to play with the ball, but it was too high up in the sky. She tried to jump and reach it, but she couldn't. Then, she had an idea. She would use a stick to knock the ball down.
Lily found a stick and tried to hit the ball. But the stick was too short. She tried again and again, but she couldn't reach it. She felt sad.
Suddenly, a kind man came by and saw Lily. He asked her what was wrong. Lily told him about the ball. The man smiled and said, "I have a useful idea!" He took out a long stick and used it to knock the ball down. Lily was so happy! She thanked the man and they played together in the sunshine.
achieved tok/s: 74.435837

FP16:

scroy@scroy-mbp torchchat % ./cmake-out/aoti_run ./model_fp16.so -z ./.model-artifacts/meta-llama/Llama-2-7b-chat-hf/tokenizer.bin -t 0 -i "Once upon a time"
Failed to load ./.model-artifacts/meta-llama/Llama-2-7b-chat-hf/tokenizer.bin into a Tiktoken tokenizer. Trying sentencepiece tokenizer..
Once upon a time for EntityConCon on introduce introduceow introduce introducestdstd Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity Entity
achieved tok/s: 111.989460

BF16:

torchchat % ./cmake-out/aoti_run ./model_bf16.so -z ./.model-artifacts/meta-llama/Llama-2-7b-chat-hf/tokenizer.bin -t 0 -i "Once upon a time"
Failed to load ./.model-artifacts/meta-llama/Llama-2-7b-chat-hf/tokenizer.bin into a Tiktoken tokenizer. Trying sentencepiece tokenizer..
Once upon a time for Trackiegel Track introduce Track Track amor introduceiegel weitere Track introduce amorstdstdstd Entity Track Track introduce amorstdiegel Entity Entity Entity amoriegel amor Entity Entityiegel Entity Entity Entity Entity Entity Entity Entityiegel Trackкипеди Entity Entityiegel Entity Entity Entity Entity Entity amor Entity Trackкипеди Entity Trackкипеди amor amor Entity Entityiegel amor amoriegel Entity Track Trackiegel amor Track introduce Trackiegeliegel Entityiegel Entity Entity Entityiegel Entity Entity Entity Entity Entityiegel Entity amor Entity Entityiegel Entityiegel Track serкипеди Entity Entity Entity Entity Track Track go forelve introduce Track amor Track Track introduce Entity Entity Entity Entity Entityiegeliegeliegeliegel Entity Track Track go for amorкипеди amor Entityiegel Entity Entity Entity Track amor Entity Trackiegel Track go for that amor Entity Entity Entity Entity Entity Entity Track Track go for that Besch Entity Entity Entity Entity Track amoriegel Entity Entity Entity Entity Entity Entityiegel Entity Entity Entityiegel Entityiegel Entity Entity Entity Entity Entity Entity Entity Entityiegeliegel Entity Entity amor Trackiegel Entityiegel Entityiegel Entity Entity Entity Entity Entity amor Trackiegel Entity Entity Entity Entity Entity Entity amor Entity Entity amor amor Entity Entity Entity Entity Entityiegel Entity Entity amor Entity Entity Track Track go amoriegel amor Track go for
achieved tok/s: 66.130705
metascroy commented 5 months ago

Here's a PR: https://github.com/pytorch/torchchat/pull/492/

ianbarber commented 5 months ago

Is this one closed out by the referenced PR?

metascroy commented 5 months ago

@ianbarber it should be, but I to go through the the flow again to double check. I'll do that sometime today and close if things look good.

mikekgfb commented 5 months ago

@metascroy please close this issue once you have ensured that https://github.com/pytorch/torchchat/pull/492 is the proper fix.

metascroy commented 5 months ago

I'm unable to verify the fix works due to https://github.com/pytorch/torchchat/issues/509.

metascroy commented 5 months ago

Confirmed this is fixed.