Long inputs to Flan-T5/UL2 text generation with load_in_8bit=True outputs <pad> tokens repeatedly

akkikiki commented 1 year ago

System Info

transformers version: 4.27.0.dev0
Platform: Linux-5.4.228-141.415.amzn2int.x86_64-x86_64-with-glibc2.17
Python version: 3.8.16
Huggingface_hub version: 0.12.1
PyTorch version (GPU?): 1.13.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@younesbelkada

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

When input texts are short, the generated texts look good. But when input texts are long e.g., the following, then it produces tokens.

Input

model = T5ForConditionalGeneration.from_pretrained("google/flan-ul2", device_map=device_map) # same with "google/flan-t5-xxl" 

load_in_8bit=True)
input_text = """Q: Answer the following yes/no question by reasoning step-by-step. Could a dandelion suffer from hepatitis?
A: Hepatitis only affects organisms with livers. Dandelions don’t have a liver. The answer is yes.
Q: Answer the following yes/no question by reasoning step-by-step. Can you write a whole Haiku in a single tweet?
A: """

input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids, max_length=100)
print(tokenizer.decode(outputs[0]))

Output:

<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

Expected behavior

This is the result when loaded with load_in_8bit=False

<pad> A Haiku is a Japanese poetry form that uses a 5-7-5 syllable structure. A typical tweet is limited to 140 characters. The answer is no.</s>

younesbelkada commented 1 year ago

Thanks a lot for the issue @akkikiki ! What is the hardware you are using + bnb version?

akkikiki commented 1 year ago

Thanks a lot for the reply! The hardware is 8 V100 (16GB) GPUs and the bnb version is 0.37.0.

younesbelkada commented 1 year ago

I think sadly there is indeed an issue with V100 right now as stated by @TimDettmers here: https://github.com/huggingface/transformers/pull/21955#issuecomment-1455235281 It should be fixed somehow soon, also as stated in this comment, more universal methods (that cover most of GPU hardware) should be published soon!

akkikiki commented 1 year ago

Thanks @younesbelkada! Interesting, so some smart workaround for GPUs without hardware-level support on int8.

FYI, I actually played around with BitsAndBytesConfig, and seems like quantization_config = BitsAndBytesConfig(llm_int8_threshold=5.0) resolved the issue.

Output result with quantization_config = BitsAndBytesConfig(llm_int8_threshold=5.0):

<pad> A Haiku is a Japanese poetry form that uses a 5-7-5 syllable structure. A typical tweet is limited to 140 characters. The answer is no.</s>

Will just close this thread for now. Thanks again for the heads up on V100 issue!

younesbelkada commented 1 year ago

This is great! Thanks for the advice! Would you mind posting it in #21955 so that people can be aware of this hack 🙏 ?

akkikiki commented 1 year ago

This is great! Thanks for the advice! Would you mind posting it in #21955 so that people can be aware of this hack 🙏 ?

Will do!

younesbelkada commented 1 year ago

Thanks a lot @akkikiki ! Much apprciated!

huggingface / transformers