On sagemaker, getting error "Input prompt (9762 tokens) is too long and exceeds limit of 8192"

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

29.95k stars 4.52k forks source link

On sagemaker, getting error "Input prompt (9762 tokens) is too long and exceeds limit of 8192" #7048

Open bitliner opened 3 months ago

bitliner commented 3 months ago

I'm running vllm with:

model = meta-llama/Meta-Llama-3-8B-Instruct
DJL image = 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124
the env of the deployed model is as follow

{
    "TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ROLLING_BATCH": "vllm",
    "OPTION_TRUST_REMOTE_CODE": "true",
    "MAX_ROLLING_BATCH_SIZE": "32",
    "MAX_LENGTH": "15360",
}

The predictor (after deploying the model on the endpoint_name) is instantiated and used as follow:

predictor = Predictor( endpoint_name=endpoint_name, session=sagemaker_session, serializer=JSONSerializer(), deserializer=JSONDeserializer(), )

predictor.predict({
                "messages": prompt,
                "parameters": parameters,
})

I've tried to add the parameter MAX_LENGTH=15360 to the env (as above) but it does not seems to make a difference.

Should I put some other parameter in the env?

ywang96 commented 3 months ago

This is probably a better question for @lanking520 and @siddvenk - I assume vLLM engine inside these docker images was set to be initialized with --max-model-len of 8192?

siddvenk commented 3 months ago

Hi @bitliner

Can you try using OPTION_MAX_MODEL_LEN for this configuration? See https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/vllm_user_guide.html#advanced-vllm-configurations

for the specific mappings.

bitliner commented 3 months ago

By adding to the env { ... "OPTION_MAX_MODEL_LEN": "15360" ...} it raises the following error:

[INFO ] PyProcess - W-245-12ab844166a18f3-stdout: ValueError: User-specified max_model_len (15360) is greater than the derived max_model_len (max_position_embeddings=8192 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value is correct and within the model context size.

Adding to the env the option { ... "OPTION_MAX_POSITION_EMBEDDINGS": "15360" ...} raises the following error during model initialization:

ai.djl.serving.http.ServerStartupException: Failed to initialize startup models and workflows

I guess because it is not a supported option. Looks like I should update the config.json of the model? :(

siddvenk commented 3 months ago

Sorry, I didn't catch earlier that you are using the meta-llama/Meta-Llama-3-8B-Instruct model. This model only supports a context length of 8192 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/config.json#L13.

You cannot use a max length greater than 8192 for this model.

Are you trying to use the llama3.1 variant which supports context lengths of 128k? https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct. This updated version of llama3 will support 15360 and longer.

Note that the issues you are seeing are not problems with vllm or djl-serving. It is purely a limitation of the model you are using. Updating the config.json is note recommended as the model does not support longer than 8192 context length.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!