HfAPI().create_inference_endpoint errors and does not follow documentation

SimKennedy commented 1 month ago

Describe the bug

Using hf_api.create_inference_endpoint with configuration in documentation raises error.

https://huggingface.co/docs/huggingface_hub/en/package_reference/hf_api#huggingface_hub.HfApi.create_inference_endpoint.task

Instance types appear different in the available vendors too: https://api.endpoints.huggingface.cloud/v2/provider https://huggingface.co/docs/inference-endpoints/en/pricing

Terminology in vendor list does not match the API:

Eg. instanceSize "small" != "x1"

Reproduction

from huggingface_hub import HfApi
api = HfApi()
create_inference_endpoint(
    "my-endpoint-name",
    repository="gpt2",
    framework="pytorch",
    task="text-generation",
    accelerator="cpu",
    vendor="aws",
    region="us-east-1",
    type="protected",
    instance_size="medium",
    instance_type="c6i",
)

Bad request:
400: Instance compute 'Cpu' - 'c6i' - 'medium' in 'aws' - 'us-east-1' not found

Logs

Traceback (most recent call last):
  File "/home/sim/.cache/pypoetry/virtualenvs/env-Nnk0OfKl-py3.10/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 358, in hf_raise_for_status
    raise BadRequestError(message, response=response) from e
huggingface_hub.utils._errors.BadRequestError:  (Request ID: cX-WLr)

Bad request:
400: Instance compute 'Cpu' - 'c6i' - 'medium' in 'aws' - 'us-east-1' not found

System info

- huggingface_hub version: 0.23.0
- Platform: Linux-6.8.0-76060800daily20240311-generic-x86_64-with-glibc2.35
- Python version: 3.10.13
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /home/sim/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: simkennedy
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.2.1
- Jinja2: 3.1.3
- Graphviz: N/A
- keras: N/A
- Pydot: 2.0.0
- Pillow: 10.2.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.23.5
- pydantic: 2.6.4
- aiohttp: 3.9.3
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /home/sim/.cache/huggingface/hub
- HF_ASSETS_CACHE: /home/sim/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/sim/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

juliensimon commented 1 month ago

I see this too with 0.23.0. Copying and pasting the example in https://huggingface.co/blog/tgi-messages-api doesn't work.

Bad request:
400: Instance compute 'Gpu' - 'p4de' - '2xlarge' in 'aws' - 'us-east-1' not found

I tried GCP too, same result. I used the parameters for the curl call in the IE page.

curl https://api.endpoints.huggingface.cloud/v2/endpoint/juliensimon \
-X POST \
-d '{"compute":{"accelerator":"gpu","instanceSize":"x4","instanceType":"nvidia-l4","scaling":{"maxReplica":1,"minReplica":1}},"model":{"framework":"pytorch","image":{"custom":{"health_route":"/health","env":{"MAX_BATCH_PREFILL_TOKENS":"2048","MAX_INPUT_LENGTH":"1024","MAX_TOTAL_TOKENS":"1512","MODEL_ID":"/repository"},"url":"ghcr.io/huggingface/text-generation-inference:2.0.2"}},"repository":"meta-llama/Meta-Llama-3-8B-Instruct","task":"text-generation"},"name":"meta-llama-3-8b-instruct-plc","provider":{"region":"us-east4","vendor":"gcp"},"type":"protected"}' \
-H "Content-Type: application/json" \
-H "Authorization: Bearer XXXXX"

endpoint = create_inference_endpoint(
    name="llama3-8b-julien-demo",
    repository="meta-llama/Meta-Llama-3-8B-Instruct",
    framework="pytorch",
    task="text-generation",
    accelerator="gpu",
    vendor="google",
    region="us-east4",
    type="protected",
    instance_type="nvidia-l4",
    instance_size="x4",
    custom_image={
        "health_route": "/health",
        "env": {
            "MAX_INPUT_LENGTH": "1024",
            "MAX_BATCH_PREFILL_TOKENS": "2048",
            "MAX_TOTAL_TOKENS": "1512",
            "MODEL_ID": "/repository",
        },
        "url": "ghcr.io/huggingface/text-generation-inference:2.0.2", # use this build or newer
    },
)

Bad request:
400: Instance compute 'Gpu' - 'l4' - 'x4' in 'google' - 'us-east4' not found

@philschmid any idea?

philschmid commented 1 month ago

The naming was adjust. pinging @co42 here.

philschmid commented 1 month ago

The naming here should be correct: https://huggingface.co/docs/inference-endpoints/pricing

can you try intel-icl and x4?

juliensimon commented 1 month ago

The Google example works. My mistake was the vendor name: "gcp", not "google" :)

The blog post example works when changed to:

    instance_type="nvidia-a100",
    instance_size="x2",

juliensimon commented 1 month ago

https://github.com/huggingface/blog/pull/2073

Wauplin commented 1 month ago

Thanks everyone for reporting/fixing this! Just to be sure, is there still something to fix on huggingface_hub's doc side or all good now?

Wauplin commented 3 weeks ago

Closing it now :) Just let me know if something remains unclear.

huggingface / huggingface_hub