Codellama generates wierd tokens with TGI 0.0.24

pinak-p commented 1 month ago

System Info

Using TGI v0.0.24 to deploy the model on SageMaker

Who can help?

@dacorvo

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I'm using the below configuration to deploy the model on SageMaker.

hub = {
    "HF_MODEL_ID": "meta-llama/CodeLlama-7b-Instruct-hf",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "fp16",
    "MAX_BATCH_SIZE": "4",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
    "HF_TOKEN": <>
}

huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.24"),
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

Text Generation:

predictor.predict(
    {
        "inputs": "Write a function to generate random numbers in python",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 256,
            "temperature": 0.1,
            "top_k": 10,

        }
    }
)

Output:

[{'generated_text': 'Write a function to generate random numbers in python stick (or (or (or (E2 (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or (or'}]

Expected behavior

Expectation is to get some text that is not weird and makes some sense.

dacorvo commented 1 month ago

@pinak-p I reproduce your issue, both on SageMaker and locally with a 0.0.24 image.

I verified that deploying the model with neuronx-tgi 0.0.23 leads to meaningful results, so this seems to be only that version. I also verified that I had no issue:

invoking the model generate method locally with optimum-neuron 0.0.25dev,
using a newly built 0.0.25dev image deployed locally (not on sagemaker).

dacorvo commented 1 month ago

@pinak-p this is not only a TGI issue: I also get gibberish with optimum-neuron itself, which makes me think that this is actually the same issue as the one you reported in transformers-neuronx: https://github.com/aws-neuron/transformers-neuronx/issues/94. Can you verify that the issue also happens with a vanilla transformers-neuronx model using continuous batching ?

dacorvo commented 1 month ago

@pinak-p could you check with version 0.0.25 ?

pinak-p commented 1 month ago

What's the URL for 0.0.25 ? I don't see it here https://github.com/aws/deep-learning-containers/blob/master/available_images.md ... nor does the sagemaker SDK have the version.

dacorvo commented 1 month ago

@pinak-p it is still being deployed, but you can use the neuronx-tgi docker image on an ec2 instance. https://github.com/huggingface/optimum-neuron/pkgs/container/neuronx-tgi. Alternatively, you can use directly optimum-neuron and create a pipeline (see the documentation).

huggingface / optimum-neuron