huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
210 stars 62 forks source link

Codellama generates wierd tokens with TGI 0.0.24 #704

Open pinak-p opened 1 month ago

pinak-p commented 1 month ago

System Info

Using TGI v0.0.24 to deploy the model on SageMaker

Who can help?

@dacorvo

Information

Tasks

Reproduction (minimal, reproducible, runnable)

I'm using the below configuration to deploy the model on SageMaker.

hub = {
    "HF_MODEL_ID": "meta-llama/CodeLlama-7b-Instruct-hf",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "fp16",
    "MAX_BATCH_SIZE": "4",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
    "HF_TOKEN": <>
}

huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.24"),
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

Text Generation:

predictor.predict(
    {
        "inputs": "Write a function to generate random numbers in python",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 256,
            "temperature": 0.1,
            "top_k": 10,

        }
    }
)

Output:

[{'generated_text': 'Write a function to generate random numbers in python stick (or (or (or (E2 (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or'}]

Expected behavior

Expectation is to get some text that is not weird and makes some sense.

dacorvo commented 1 month ago

@pinak-p I reproduce your issue, both on SageMaker and locally with a 0.0.24 image.

I verified that deploying the model with neuronx-tgi 0.0.23 leads to meaningful results, so this seems to be only that version. I also verified that I had no issue:

dacorvo commented 1 month ago

@pinak-p this is not only a TGI issue: I also get gibberish with optimum-neuron itself, which makes me think that this is actually the same issue as the one you reported in transformers-neuronx: https://github.com/aws-neuron/transformers-neuronx/issues/94. Can you verify that the issue also happens with a vanilla transformers-neuronx model using continuous batching ?

dacorvo commented 1 month ago

@pinak-p could you check with version 0.0.25 ?

pinak-p commented 1 month ago

What's the URL for 0.0.25 ? I don't see it here https://github.com/aws/deep-learning-containers/blob/master/available_images.md ... nor does the sagemaker SDK have the version.

dacorvo commented 1 month ago

@pinak-p it is still being deployed, but you can use the neuronx-tgi docker image on an ec2 instance. https://github.com/huggingface/optimum-neuron/pkgs/container/neuronx-tgi. Alternatively, you can use directly optimum-neuron and create a pipeline (see the documentation).