nvidia/Llama3-ChatQA-1.5-70B failing to start

mariokostelac commented 1 month ago

System Info

I've used the code suggested on https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B to run inference on AWS inferentia chips.

Specifically

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "nvidia/Llama3-ChatQA-1.5-70B",
    "HF_NUM_CORES": "24",
    "HF_BATCH_SIZE": "4",
    "HF_SEQUENCE_LENGTH": "4096",
    "HF_AUTO_CAST_TYPE": "bf16",  
    "MAX_BATCH_SIZE": "4",
    "MAX_INPUT_LENGTH": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.21"),
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.48xlarge",
    container_startup_health_check_timeout=3600,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

I saw many warnings like:

#033[2m2024-05-16T10:25:30.983525Z#033[0m #033[33m WARN#033[0m #033[2mtokenizers::tokenizer::serialization#033[0m#033[2m:#033[0m #033[2m/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.15.2/src/tokenizer/serialization.rs#033[0m#033[2m:#033[0m#033[2m159:#033[0m Warning: Token '<|end_of_text|>' was expected to have ID '128001' but was given ID 'None'

It failed to start with following error:

 File "/usr/local/lib/python3.10/dist-packages/text_generation_server/interceptor.py", line 20, in intercept
    return await response
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 39, in Warmup
    max_tokens = self.generator.warmup(request.batch)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 336, in warmup
    self.prefill(batch)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 404, in prefill
    selector = TokenSelector.create(
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/generation/token_selector.py", line 136, in create
    assert eos_token_id is not None and not isinstance(eos_token_id, list)

Is there some prep needed to be done to run the model on inferentia with this library?

Who can help?

@JingyaHuang @daco

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Available above.

Expected behavior

Endpoint should start.

mariokostelac commented 1 month ago

I can confirm that meta-llama/Meta-Llama-3-70B-Instruct fails the same way.

dacorvo commented 1 month ago

This issue is fixed with version0.0.22

mariokostelac commented 1 month ago

@dacorvo trying it out with 0.0.22 🙇

dacorvo commented 1 month ago

The corresponding pull-request: #580 . The sagemaker python package might not have been updated yet to support 0.0.22 (it was due later today).

Update: It is actually available (great !). FYI the image_uri should be something like: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.22-neuronx-py310-ubuntu22.04

mariokostelac commented 1 month ago

Yes, figured it's available, but it's still creating the endpoint 😁 .

mariokostelac commented 1 month ago

Thanks a lot @dacorvo, I can confirm that it worker for me by just changing the version to 0.0.22 in the snippet above? Do you know who'd be responsible for fixing that on HF UI?

dacorvo commented 1 month ago

@mariokostelac thank you for the feedback. I'll take care of it. We were actually waiting for the sagemaker update, and I had not realized it was ready.

dacorvo commented 1 month ago

The update was done this morning, but it has not been refreshed yet. It should be fixed soon.

mariokostelac commented 1 month ago

Thanks a lot for the quick support on this issue. I'm running now with the original model (nvidia one) to verify that it works there too. Given that tokenizer configs are the same, I'd be very surprised if it didn't.

dacorvo commented 1 month ago

Feel free to report any issues you get: feedback on such new features/models is very valuable.

huggingface / optimum-neuron