triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

Flan-T5 quality decreases with bigger models when using fastertransformer #95

Open lakshaykc opened 1 year ago

lakshaykc commented 1 year ago

Description

Branch: main
Docker Version: 20.10.21
GPU Type: A100 40GB
Triton Docker Image: triton_with_ft:22.12

Reproduced Steps

I'm following the instructions by @byshiue to test Flan-T5 with faster transformer from here.

Please try the following scripts on latest main branch. You don't need to do any modification on converter.

sudo apt-get install git-lfs
git lfs install
git lfs clone https://huggingface.co/google/flan-t5-small

python3 ./build/_deps/repo-ft-src/examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py \
        -saved_dir flan-t5-small/c-models \
        -in_file flan-t5-small/ \
        -inference_tensor_para_size 1 \
        -weight_data_type fp32

The Issue Is Described Below

When I increase the model size - flan-t5-small, flan-t5-base, flan-t5-large, flan-t5-xl, flan-t5-xxl, the quality of the summarization drops as measured by rouge scores especially for flan-t5-xl and flan-t5-xxl. In factflan-t5-xxl output is meaningless. Any ideas why this could be happening?

Below are the outputs for each of the models. I've omitted the article itself in the interest fo space.

flan-t5-small

---------------------------------------------------------
FT Generated : 
 Highlights :  James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .
"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .

 Summary :  James Best, best known for his role of bumbling sheriff Rosco P. Coltrane, died on Monday. He died in hospice in Hickory, North Carolina, of complications from pneumonia. He'd been a busy actor for decades in theater and in Hollywood.
---------------------------------------------------------
---------------------------------------------------------
HF Generated : 
 Highlights :  James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .
"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .

 Summary :  James Best, best known for his role on "The Dukes of Hazzard," died Monday. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia. Best's "Hazzard" co-stars paid tribute to the late actor.
---------------------------------------------------------
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:17<00:00,  1.21it/s]
Hugging Face (total latency: 15.558652 sec)
rouge1 : 34.14350291383183
rouge2 : 13.62563870995546
rougeL : 22.388296823799017
rougeLsum : 29.581662314674862

Faster Transformers (total latency: 1.7424339999999998 sec)
rouge1 : 32.59575317449809
rouge2 : 11.908051740809197
rougeL : 22.60170632237849
rougeLsum : 28.20323792721303

flan-t5-base

---------------------------------------------------------
FT Generated : 
 Highlights :  James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .
"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .

 Summary :  James Best, best known for his role as bumbling Sheriff Rosco P. Coltrane, dies. Best was born Jewel Guy in Powderly, Kentucky. Best was orphaned at 3 and raised in rural Indiana.
---------------------------------------------------------
---------------------------------------------------------
HF Generated : 
 Highlights :  James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .
"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .

 Summary :  James Best, best known for his role as sheriff Rosco P. Coltrane, died Monday. Best died in hospice in Hickory, North Carolina, of complications from pneumonia. Best was born Jewel Guy on July 26, 1926, in Powderly, Kentucky.
---------------------------------------------------------
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:22<00:00,  1.07s/it]
Hugging Face (total latency: 19.767059 sec)
rouge1 : 33.91455957776033
rouge2 : 13.909255716762106
rougeL : 24.246276707609432
rougeLsum : 29.344668270359055

Faster Transformers (total latency: 2.7167290000000004 sec)
rouge1 : 32.13985165509422
rouge2 : 9.894492905898963
rougeL : 20.389183239096653
rougeLsum : 26.219126505337087

flan-t5-large

---------------------------------------------------------
FT Generated : 
 Highlights :  James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .
"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .

 Summary :  James Best, best known for his portrayal of the bumbling sheriff on TV's "Hazzard," died in Hickory, North Carolina.</s>.
---------------------------------------------------------
---------------------------------------------------------
HF Generated : 
 Highlights :  James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .
"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .

 Summary :  James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV's "The Dukes of Hazzard," died Monday after a brief illness.</s>.
---------------------------------------------------------
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:41<00:00,  1.96s/it]
Hugging Face (total latency: 36.387930999999995 sec)
rouge1 : 32.34187001706099
rouge2 : 12.482352524710635
rougeL : 22.125700262630282
rougeLsum : 26.983349878065727

Faster Transformers (total latency: 4.8431939999999996 sec)
rouge1 : 30.195375053794983
rouge2 : 9.750832497850764
rougeL : 20.317451247121248
rougeLsum : 25.48339311225225

flan-t5-xl

---------------------------------------------------------
FT Generated : 
 Highlights :  James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .
"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .

 Summary :  "I'm going to miss him," said his wife.</s>.
---------------------------------------------------------
---------------------------------------------------------
HF Generated : 
 Highlights :  James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .
"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .

 Summary :  James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV's "The Dukes of Hazzard," died Monday after a brief illness.</s>.
---------------------------------------------------------
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:59<00:00,  2.82s/it]
Hugging Face (total latency: 54.09382699999999 sec)
rouge1 : 37.15151060773974
rouge2 : 16.028054056196453
rougeL : 25.409042296934043
rougeLsum : 31.530806378745208

Faster Transformers (total latency: 5.097423000000001 sec)
rouge1 : 20.467612899202297
rouge2 : 6.373011622078853
rougeL : 14.748323218023184
rougeLsum : 18.368823552336195

flan-t5-xxl

---------------------------------------------------------
FT Generated : 
 Highlights :  James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .
"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .

 Summary :  Theline,tearsoldedandasd/lds/dasandsold"</s>.
---------------------------------------------------------
---------------------------------------------------------
HF Generated : 
 Highlights :  James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .
"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .

 Summary :  James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV's "The Dukes of Hazzard," died Monday after a brief illness. He was 88.</s>.
---------------------------------------------------------
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [01:03<00:00,  3.02s/it]
Hugging Face (total latency: 63.444942 sec)
rouge1 : 39.61532060464983
rouge2 : 17.064800268920198
rougeL : 26.404488769902397
rougeLsum : 32.813008599688764

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:21<00:00,  1.04s/it]
Faster Transformers (total latency: 21.789198000000006 sec)
rouge1 : 5.380935735037773
rouge2 : 0.9247648902821317
rougeL : 4.753279997691197
rougeLsum : 4.84140401761153
lakshaykc commented 1 year ago

Running with fp32 works as expected and there is no quality drop.

byshiue commented 1 year ago

How about using bf16?

lakshaykc commented 1 year ago

bf16 works but it slower than fp32

zoltan-fedor commented 1 year ago

I have observed the same.

flan-t5-xl model's accuracy is terrible with fp16, but it is good with bf16 and also with fp32.

But I have seen slightly better rogueLsum stats on bf16 than fp32 and somewhat better speed too, so to me bf16 seems to be marginally better than fp32 (and obviously bf16 uses less GPU memory).

lakshaykc commented 1 year ago

That's interesting. bf16 was also slightly better than fp32 for me, but slower.

rhenry-nv commented 1 year ago

Hi, I took a look at this issue and it does not seem to be a bug in FT.

FP32 produces large activations after FC2 (~120K) for some inputs. This causes FC2 to produce NANs in FP16 which will impact the accuracy.

There seems to be other instances of this for flan-t5 like here and here.

Also from my experiments (using 22.09 container at least) BF16 seems to give speedups over FP32. I used 80GB A100 PCIE.

Results & command for FP32:

python3 ../examples/pytorch/t5/summarization.py --ft_model_location /workspace/local_models/flan-t5-xl/c-models/ --test_ft --hf_model_location google/flan-t5-xl --cache_path ../models/datasets/opt_summarization/ccdv/ --max_ite 50 --data_type fp32

Faster Transformers (total latency: 35.916771 sec)
beam_id: 0
rouge1 : 36.282111048575196
rouge2 : 15.41837215888821
rougeL : 25.89882726414888
rougeLsum : 25.90857428917886

Results & command for BF16:

python3 ../examples/pytorch/t5/summarization.py --ft_model_location /workspace/local_models/flan-t5-xl/c-models/ --test_ft --hf_model_location google/flan-t5-xl --cache_path ../models/datasets/opt_summarization/ccdv/ --max_ite 50 --data_type bf16

Faster Transformers (total latency: 24.815519999999992 sec)
beam_id: 0
rouge1 : 35.99305593786568
rouge2 : 15.223795282570936
rougeL : 26.12692020357301
rougeLsum : 26.130648492436936

HF gives good results in FP16 because they clamp the outputs of self attn, cross attn and ffn to be within the FP16 range. We don't currently do this in FT.

byshiue commented 1 year ago

@lakshaykc do you really test HF on FP16? I try HF flan-t5-xl on FP32 and FP16 and observe similar accuracies like FT

HF FP32:

bhsueh@9cc4f0c2782c:/home/scratch.bhsueh_sw/FasterTransformer_new/build$ python3 ../examples/pytorch/t5/summarization.py --ft_model_location /home/scratch.bhsueh_gpu_2/models/flan-t5/flan-t5-xl/c-models --hf_model_location /home/scratch.bhsueh_gpu_2/models/flan-t5/flan-t5-xl/ --test_hf --cache_path /data/gpt_dataset/ccdv/ --data_type fp32 
Reusing dataset cnn_dailymail (/data/gpt_dataset/ccdv/ccdv___cnn_dailymail/3.0.0/3.0.0/0107f7388b5c6fae455a5661bcd134fc22da53ea75852027040d8d1e997f101f)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 237.81it/s]
[INFO] load HF model spend 39.482971 sec
Token indices sequence length is longer than the specified maximum sequence length for this model (744 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------
HF Generated : 
 Article :  (CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV's "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he'd been a busy actor for decades in theater and in Hollywood, Best didn't become famous until 1979, when "The Dukes of Hazzard's" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best's Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his "hot pursuit" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive "kew-kew-kew" chuckle and for goofy catchphrases such as "cuff 'em and stuff 'em!" upon making an arrest. Among the most popular shows on TV in the early '80s, "The Dukes of Hazzard" ran until 1985 and spawned TV movies, an animated series and video games. Several of Best's "Hazzard" co-stars paid tribute to the late actor on social media. "I laughed and learned more from Jimmie in one hour than from anyone else in a whole year," co-star John Schneider, who played Bo Duke, said on Twitter. "Give Uncle Jesse my love when you see him dear friend." "Jimmy Best was the most constantly creative person I have ever known," said Ben Jones, who played mechanic Cooter on the show, in a Facebook post. "Every minute of his long life was spent acting, writing, producing, painting, teaching, fishing, or involved in another of his life's many passions." Born Jewel Guy on July 26, 1926, in Powderly, Kentucky, Best was orphaned at 3 and adopted by Armen and Essa Best, who renamed him James and raised him in rural Indiana. Best served in the Army during World War II before launching his acting career. In the 1950s and 1960s, he accumulated scores of credits, playing a range of colorful supporting characters in such TV shows as "The Twilight Zone," "Bonanza," "The Andy Griffith Show" and "Gunsmoke." He later appeared in a handful of Burt Reynolds' movies, including "Hooper" and "The End." But Best will always be best known for his "Hazzard" role, which lives on in reruns. "Jimmie was my teacher, mentor, close friend and collaborator for 26 years," Latshaw said. "I directed two of his feature films, including the recent 'Return of the Killer Shrews,' a sequel he co-wrote and was quite proud of as he had made the first one more than 50 years earlier." People we've lost in 2015 . CNN's Stella Chan contributed to this story.

 Highlights :  James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .
"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .

 Summary :  ['James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV\'s "The Dukes of Hazzard," died Monday after a brief illness..']
---------------------------------------------------------
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:41<00:00,  1.96s/it]
Hugging Face (total latency: 41.06286600000001 sec)
beam_id: 0
rouge1 : 35.156759711225185
rouge2 : 14.88218931836181
rougeL : 24.515717871560494
rougeLsum : 24.731586345417877

HF FP16:

bhsueh@9cc4f0c2782c:/home/scratch.bhsueh_sw/FasterTransformer_new/build$ python3 ../examples/pytorch/t5/summarization.py --ft_model_location /home/scratch.bhsueh_gpu_2/models/flan-t5/flan-t5-xl/c-models --hf_model_location /home/scratch.bhsueh_gpu_2/models/flan-t5/flan-t5-xl/ --test_hf --cache_path /data/gpt_dataset/ccdv/ --data_type fp16 
Reusing dataset cnn_dailymail (/data/gpt_dataset/ccdv/ccdv___cnn_dailymail/3.0.0/3.0.0/0107f7388b5c6fae455a5661bcd134fc22da53ea75852027040d8d1e997f101f)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 248.45it/s]
[INFO] load HF model spend 70.423421 sec
Token indices sequence length is longer than the specified maximum sequence length for this model (744 > 512). Running this sequence through the model will result in indexing errors
---------------------------------------------------------
HF Generated : 
 Article :  (CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV's "The Dukes of Hazzard," died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he'd been a busy actor for decades in theater and in Hollywood, Best didn't become famous until 1979, when "The Dukes of Hazzard's" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best's Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his "hot pursuit" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive "kew-kew-kew" chuckle and for goofy catchphrases such as "cuff 'em and stuff 'em!" upon making an arrest. Among the most popular shows on TV in the early '80s, "The Dukes of Hazzard" ran until 1985 and spawned TV movies, an animated series and video games. Several of Best's "Hazzard" co-stars paid tribute to the late actor on social media. "I laughed and learned more from Jimmie in one hour than from anyone else in a whole year," co-star John Schneider, who played Bo Duke, said on Twitter. "Give Uncle Jesse my love when you see him dear friend." "Jimmy Best was the most constantly creative person I have ever known," said Ben Jones, who played mechanic Cooter on the show, in a Facebook post. "Every minute of his long life was spent acting, writing, producing, painting, teaching, fishing, or involved in another of his life's many passions." Born Jewel Guy on July 26, 1926, in Powderly, Kentucky, Best was orphaned at 3 and adopted by Armen and Essa Best, who renamed him James and raised him in rural Indiana. Best served in the Army during World War II before launching his acting career. In the 1950s and 1960s, he accumulated scores of credits, playing a range of colorful supporting characters in such TV shows as "The Twilight Zone," "Bonanza," "The Andy Griffith Show" and "Gunsmoke." He later appeared in a handful of Burt Reynolds' movies, including "Hooper" and "The End." But Best will always be best known for his "Hazzard" role, which lives on in reruns. "Jimmie was my teacher, mentor, close friend and collaborator for 26 years," Latshaw said. "I directed two of his feature films, including the recent 'Return of the Killer Shrews,' a sequel he co-wrote and was quite proud of as he had made the first one more than 50 years earlier." People we've lost in 2015 . CNN's Stella Chan contributed to this story.

 Highlights :  James Best, who played the sheriff on "The Dukes of Hazzard," died Monday at 88 .
"Hazzard" ran from 1979 to 1985 and was among the most popular shows on TV .

 Summary :  ['James Best, best known for his portrayal of bumbling Sheriff Rosco P. Coltrane on "The Dukes of Hazzard," died Monday at his home in Hickory, North Carolina, of complications from pneumonia..']
---------------------------------------------------------
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:26<00:00,  1.25s/it]
Hugging Face (total latency: 26.188205000000004 sec)
beam_id: 0
rouge1 : 21.601152079417847
rouge2 : 6.627345555624651
rougeL : 16.39804173113905
rougeLsum : 16.578730031985682
lakshaykc commented 1 year ago

Hey @byshiue, I'm really sorry I missed this message. It got buried in my emails. I just tested again and you are right, the HF fp16 models perform similar to FT fp16 models. Something in my original setup was probably messed up that did fp16 for FT and fp32 for HF. You can close this issue.

baptistejamin commented 1 year ago

There is a very high chance it is related to this problem: https://github.com/huggingface/transformers/issues/20287#issuecomment-1342219429

The wo module needs to stay in fp32

lakshaykc commented 1 year ago

That would explain what I'm seeing.

@byshiue Would we need to update huggingface_t5_ckpt_convert.py to keep wo module in fp32?