Open zky001 opened 8 months ago
Hi, thanks for running our codes, it looks like you are encountering an issue with vllm. You could refer to https://github.com/vllm-project/vllm/issues/2418 to try the solution mentioned there. Since the vllm running may depend on cuda version and torch version, I cannot determine the solution for your case. If you still encounter issues with vllm, you may turn to hugging face inference instead.
The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (1792). Try increasing
gpu_memory_utilization
or decreasingmax_model_len
when initializing the engine