Closed milo157 closed 6 months ago
Seems like there is a issue with the latest release - it works if I checkout this specific hash: 6ef09b08f88b675f84b7140238286e5d4c5304c8 (version v0.4.1) however checking out the tag v0.4.1 does not work and results in a different error
I still get the same error with a fresh install of vllm 0.5.0+neuron214
:
........
Compiler status PASS
2024-09-09 14:23:11.000504: 58607 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
.
Compiler status PASS
2024-09-09 14:23:32.000296: 58608 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-Sep-09 14:23:34.0116 58275:58494 [5] init.cc:125 CCOM WARN NET/Plugin : No plugin found (libnccl-net.so), multi-instance execution will not work
Traceback (most recent call last):
File "/home/ec2-user/fmwork/./driver", line 55, in <module>
main()
File "/home/ec2-user/fmwork/./driver", line 50, in main
dts = fmwork.loop(par.reps, llm.generate, kwargs)
File "/home/ec2-user/fmwork/fmwork.py", line 71, in loop
function(**kwargs)
File "/home/ec2-user/vllm/0.5.0/py310/vllm/vllm/utils.py", line 677, in inner
return fn(*args, **kwargs)
File "/home/ec2-user/vllm/0.5.0/py310/vllm/vllm/entrypoints/llm.py", line 304, in generate
outputs = self._run_engine(use_tqdm=use_tqdm)
File "/home/ec2-user/vllm/0.5.0/py310/vllm/vllm/entrypoints/llm.py", line 554, in _run_engine
step_outputs = self.llm_engine.step()
File "/home/ec2-user/vllm/0.5.0/py310/vllm/vllm/engine/llm_engine.py", line 773, in step
output = self.model_executor.execute_model(
File "/home/ec2-user/vllm/0.5.0/py310/vllm/vllm/executor/neuron_executor.py", line 53, in execute_model
and execute_model_req.blocks_to_copy == {}), (
AssertionError: Cache operations are not supported for Neuron backend.
(20240909-py310) ec2-user@ip-10-0-173-55 fmwork$ pip list | grep vllm
vllm 0.5.0+neuron214 /home/ec2-user/vllm/0.5.0/py310/vllm
Your current environment
🐛 Describe the bug
There error is: "Cache operations are not supported for Neuron backend"
My code is: ` from typing import Optional from pydantic import BaseModel from vllm import LLM, SamplingParams from transformers import AutoTokenizer import time import os from huggingface_hub import login
login(token = 'xxxxx')
model_id = "mistralai/Mistral-7B-Instruct-v0.2" llm = LLM( model=model_id, max_num_seqs=1, max_model_len=128, block_size=128,
The device can be automatically detected when AWS Neuron SDK is installed.
start_time = time.time()
sampling_params = SamplingParams(temperature=0.7, top_p=0.95) outputs = llm.generate(["What is your name?"], sampling_params)
End timing
end_time = time.time()
total_tokens = 0 for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") `
It seems to happen on the actually llm.generate() line
I ran this successfully a few days/weeks ago but now I suddenly get this issue. I tried to checkout other version releases but it seems to not have helped the issue