Model Failing when using HuggingFace pipeline

saptarshi059 commented 1 year ago

Hi,

So I've been trying to use very basic text generation inference with this model using HuggingFace's pipeline API. However, it keeps on crashing when trying to generate sequences with max_tokens = 10000

from transformers import pipeline generator = pipeline('text-generation', model = 'facebook/galactica-125m', device=0) generator('covid-19', renormalize_logits=True, do_sample=True, max_new_tokens=10000)[0]['generated_text']

I updated my Transformers and Torch libraries. CUDA version = 11.7 | torch = 1.14.0 (nightly) [Stable also was not working] | transformers = 4.25.1

GPUs = NVIDIA A100

Error: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream())

mkardas commented 1 year ago

Hi @saptarshi059, all the models were trained with context window of 2048 tokens. Does the above code work if you set max_new_tokens=100?

cvinker commented 1 year ago

@mkardas This explains the errors I have been getting, I am assuming one's meant to program it to be able to write longer, perhaps making it run in a loop with a portion of the previous output as its prompt.

mkardas commented 1 year ago

Yes, this kind of moving window approach should work, but is not provided out of the box.

saptarshi059 commented 1 year ago

@mkardas Oh I see. Thank you so much. Yes, it does work with max_new_tokens < 2048. I will try the moving window approach then.

paperswithcode / galai

Model Failing when using HuggingFace pipeline #53