Open bitliner opened 3 months ago
This is probably a better question for @lanking520 and @siddvenk - I assume vLLM engine inside these docker images was set to be initialized with --max-model-len
of 8192?
Hi @bitliner
Can you try using OPTION_MAX_MODEL_LEN
for this configuration? See https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/vllm_user_guide.html#advanced-vllm-configurations
for the specific mappings.
By adding to the env { ... "OPTION_MAX_MODEL_LEN": "15360" ...}
it raises the following error:
[INFO ] PyProcess - W-245-12ab844166a18f3-stdout: ValueError: User-specified max_model_len (15360) is greater than the derived max_model_len (max_position_embeddings=8192 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value is correct and within the model context size.
Adding to the env the option { ... "OPTION_MAX_POSITION_EMBEDDINGS": "15360" ...}
raises the following error during model initialization:
ai.djl.serving.http.ServerStartupException: Failed to initialize startup models and workflows
I guess because it is not a supported option. Looks like I should update the config.json of the model? :(
Sorry, I didn't catch earlier that you are using the meta-llama/Meta-Llama-3-8B-Instruct model. This model only supports a context length of 8192 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/config.json#L13.
You cannot use a max length greater than 8192 for this model.
Are you trying to use the llama3.1 variant which supports context lengths of 128k? https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct. This updated version of llama3 will support 15360 and longer.
Note that the issues you are seeing are not problems with vllm or djl-serving. It is purely a limitation of the model you are using. Updating the config.json is note recommended as the model does not support longer than 8192 context length.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
I'm running vllm with:
meta-llama/Meta-Llama-3-8B-Instruct
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124
The predictor (after deploying the model on the endpoint_name) is instantiated and used as follow:
I've tried to add the parameter
MAX_LENGTH=15360
to the env (as above) but it does not seems to make a difference.Should I put some other parameter in the env?