Inference API: Error with GPU inference

nbravulapalli commented 2 years ago

Who can help

@LysandreJik @patil-suraj

To reproduce

import requests
import json

headers = {"Authorization": f"Bearer {MY_BEARER_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/gpt2"

data = json.dumps({"inputs": INPUT_TEXT, "parameters":{"num_return_sequences":NUM_SEQUENCES,  "max_length":MAX_LENGTH},"options": {"wait_for_model": True, "use_cache": False, "use_gpu":True}})
response = requests.request("POST", API_URL, headers=headers, data=data)
print(json.loads(response.content.decode("utf-8")))

Expected behavior

For more than a month, I have used the above code snippet to retrieve text completions from gpt2 using huggingface's inference API. However, when the same snippet ran again today, the inference API gave the following response:

Error Message 1:

CUDA error: all CUDA-capable devices are busy or unavailable\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1."}

This persisted for roughly half an hour, and after that time, the API would only allow me to make API requests with very few tokens of text in the INPUT_TEXT variable. All normal-sized requests gave the following error:

Error Message 2:

{'error': 'CUDA out of memory, try a smaller payload', 'warnings': ['Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.']}

Keep in mind that when I get the above error, it is with the same arguments (the value for INPUT_TEXT, NUM_SEQUENCES, and MAX_LENGTH) that I have been using this API with for more than a month. I have checked my account, and the inference API dashboard shows that I am still within the free quota provided by huggingface (my subscription plan is the "Lab: pay as you go" option). Can you please help me resolve this?

Sample argument that causes an error: INPUT_TEXT = 'Hippocrates, another ancient Greek, established a medical school, wrote many medical treatises, and is— because of Hippocrates, another ancient Greek,' NUM_SEQUENCES = 7 MAX_LENGTH = 105

Edit

It appears that the API response message is varying between Error Messages 1 and 2 (originally it was 1, then 2, and now 1 again).

nbravulapalli commented 2 years ago

Hi!

Do have an update on this issue? Thank you for all the support.

@LysandreJik @patil-suraj

LysandreJik commented 2 years ago

cc @Narsil

Narsil commented 2 years ago

Hi @nbravulapalli ,

There indeed seemed to have been an issue with that model. It should be back up again.

We are actively tracking those issues to reduce them to a minimum, but sometimes there is indeed a memory error depending on what other models are being used at the same time.

Sorry about the issue you were seeing.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

AnanSarah commented 2 years ago

Hi, I'm facing this same error on facebook/bart-large-mnli when trying to use GPU-accelerated inference.

I am using this model for text classification and passing 10 candidate labels. When using GPU-Accelerated Inference I am getting error 400 Bad request { "error": "CUDA error: out of memory\nCUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1." }

Could anyone point me to why this is the case? Thanks.

@Narsil