huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.18k stars 26.33k forks source link

Long inputs to Flan-T5/UL2 text generation with load_in_8bit=True outputs <pad> tokens repeatedly #21987

Closed akkikiki closed 1 year ago

akkikiki commented 1 year ago

System Info

Who can help?

@younesbelkada

Information

Tasks

Reproduction

When input texts are short, the generated texts look good. But when input texts are long e.g., the following, then it produces tokens.

Input

model = T5ForConditionalGeneration.from_pretrained("google/flan-ul2", device_map=device_map) # same with "google/flan-t5-xxl" 

load_in_8bit=True)
input_text = """Q: Answer the following yes/no question by reasoning step-by-step. Could a dandelion suffer from hepatitis?
A: Hepatitis only affects organisms with livers. Dandelions don’t have a liver. The answer is yes.
Q: Answer the following yes/no question by reasoning step-by-step. Can you write a whole Haiku in a single tweet?
A: """

input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

Output:

<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

Expected behavior

This is the result when loaded with load_in_8bit=False

<pad> A Haiku is a Japanese poetry form that uses a 5-7-5 syllable structure. A typical tweet is limited to 140 characters. The answer is no.</s>
younesbelkada commented 1 year ago

Thanks a lot for the issue @akkikiki ! What is the hardware you are using + bnb version?

akkikiki commented 1 year ago

Thanks a lot for the reply! The hardware is 8 V100 (16GB) GPUs and the bnb version is 0.37.0.

younesbelkada commented 1 year ago

I think sadly there is indeed an issue with V100 right now as stated by @TimDettmers here: https://github.com/huggingface/transformers/pull/21955#issuecomment-1455235281 It should be fixed somehow soon, also as stated in this comment, more universal methods (that cover most of GPU hardware) should be published soon!

akkikiki commented 1 year ago

Thanks @younesbelkada! Interesting, so some smart workaround for GPUs without hardware-level support on int8.

FYI, I actually played around with BitsAndBytesConfig, and seems like quantization_config = BitsAndBytesConfig(llm_int8_threshold=5.0) resolved the issue.

Output result with quantization_config = BitsAndBytesConfig(llm_int8_threshold=5.0):

<pad> A Haiku is a Japanese poetry form that uses a 5-7-5 syllable structure. A typical tweet is limited to 140 characters. The answer is no.</s>

Will just close this thread for now. Thanks again for the heads up on V100 issue!

younesbelkada commented 1 year ago

This is great! Thanks for the advice! Would you mind posting it in #21955 so that people can be aware of this hack πŸ™ ?

akkikiki commented 1 year ago

This is great! Thanks for the advice! Would you mind posting it in #21955 so that people can be aware of this hack πŸ™ ?

Will do!

younesbelkada commented 1 year ago

Thanks a lot @akkikiki ! Much apprciated!