vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.77k stars 3.92k forks source link

f[Bug]: TypeError: Can't instantiate abstract class NeuronWorker with abstract method execute_worker #6269

Closed areanddee closed 2 months ago

areanddee commented 2 months ago

Your current environment

The output of `python collect_env.py`

Collecting environment information... WARNING 07-09 19:46:51 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") PyTorch version: 2.1.2+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.0 Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-1031-aws-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD EPYC 7R13 Processor CPU family: 25 Model: 1 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 1 BogoMIPS: 5299.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid Hypervisor vendor: KVM Virtualization type: full L1d cache: 512 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 8 MiB (16 instances) L3 cache: 64 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.25.2 [pip3] nvidia-nccl-cu12==2.18.1 [pip3] torch==2.1.2 [pip3] torch-neuronx==2.1.2.2.2.0 [pip3] torch-xla==2.1.3 [pip3] torchvision==0.16.2 [pip3] transformers==4.42.3 [pip3] transformers-neuronx==0.11.351 [pip3] triton==2.1.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: (0, 'instance-type: inf2.8xlarge\ninstance-id: i-0d1255be16fb43afc\n+--------+--------+--------+---------+\n| NEURON | NEURON | NEURON | PCI |\n| DEVICE | CORES | MEMORY | BDF |\n+--------+--------+--------+---------+\n| 0 | 2 | 32 GB | 00:1f.0 |\n+--------+--------+--------+---------+', '') vLLM Version: 0.5.1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: Could not collect

🐛 Describe the bug

Offline Batched Inference Example from your Quickstart page, running on Neuron inf2 (AWS inf2.8xlarge) returns: "TypeError: Can't instantiate abstract class NeuronWorker with abstract method execute_worker". The only change from your example is the example code is adding device="neuron" argument to LLM.

This code should reproduce the error:

from vllm import LLM, SamplingParams

prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="facebook/opt-125m",device="neuron") outputs = llm.generate(prompts, sampling_params)

Print the outputs.

for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

jianyinglangaws commented 2 months ago

I still get this error with the latest https://github.com/vllm-project/vllm/pull/6313 merge.

Traceback (most recent call last):
  File "/home/ubuntu/vllm/simple_test.py", line 11, in <module>
    llm = LLM(model="facebook/opt-125m",device="neuron")
  File "/home/ubuntu/vllm/vllm/entrypoints/llm.py", line 150, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 421, in from_engine_args
    engine = cls(
  File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 249, in __init__
    self.model_executor = executor_class(
  File "/home/ubuntu/vllm/vllm/executor/executor_base.py", line 46, in __init__
    self._init_executor()
  File "/home/ubuntu/vllm/vllm/executor/neuron_executor.py", line 21, in _init_executor
    self._init_worker()
  File "/home/ubuntu/vllm/vllm/executor/neuron_executor.py", line 26, in _init_worker
    self.driver_worker = NeuronWorker(
TypeError: Can't instantiate abstract class NeuronWorker with abstract method execute_worker
WoosukKwon commented 2 months ago

@liangfu Could you please take a look at @jianyinglangaws's comment?

areanddee commented 2 months ago

Saw that the patch to #6313 was merged to main, so I did a git pull origin main to update to the latest. I can confirm that the behavior seen in #6269 is still present, namely:

TypeError: Can't instantiate abstract class NeuronWorker with abstract method execute_worker

I posted this info on #6313 as well because it was speculated that #6313 would fix #6269. I am beginning to wonder about the status of support for the Neuron architecture in vLLM. Is it vLLM's intention to drop support for Neuron? I only ask this because the vLLM Documentation suggests that at least in V 0.3 vLLM did support neuron. Now, of course, those quickstart examples do not work. Any information regarding the direction vLLM is headed relative to Neuron would be appreciated.

areanddee commented 2 months ago

One extra question: can you recommend a previous known-working version of vLLM that we can pull that would work on Neuron and can you provide a tag for that?

areanddee commented 2 months ago

I tried to clone and install vLLM v0.3.0 but the requirements for that include pkg_resources which has been deprecated.

jianyinglangaws commented 2 months ago

The vllm 0.5.0 works for me with the latest Neuron SDK 2.19.0 and transformers-neuronx.

areanddee commented 2 months ago

The vllm 0.5.0 works for me with the latest Neuron SDK 2.19.0 and transformers-neuronx.

That's very interesting to know. So AFAIK, the DLAMI instance one can get by default from AWS in 2.18.2. How are you accessing 2.19.0? How do I install/upgrade to it?

jianyinglangaws commented 2 months ago

The default Neuron DLAMI (Deep Learning AMI Neuron (Ubuntu 22.04) 20240703) was updated to 2.19.0 last week.

areanddee commented 2 months ago

I see, thank you - I will try to use a new instance.

areanddee commented 2 months ago

I installed vllm on DLAMI 2.19.0 following these procedures (steps 2 and 3) VERBATIM : https://docs.vllm.ai/en/latest/getting_started/neuron-installation.html. The example does not work with the same error with the Quickstart example. So I guess we need to understand how your working instance on 2.19.0 is differing from my broken? Can you provide your collect_env.py output? Or do you want me to provide mine?

The vllm 0.5.0 works for me with the latest Neuron SDK 2.19.0 and transformers-neuronx.

jianyinglangaws commented 2 months ago

If you use the Neuron DLAMI, you do not need to do step 2 and 3. But just activate the pre-built virtual env source /opt/aws_neuronx_venv_transformers_neuronx/bin/activate and install vllm there. My collect_env.py output is:

Versions of relevant libraries:
[pip3] numpy==1.25.2
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] torch==2.1.2
[pip3] torch-neuronx==2.1.2.2.2.0
[pip3] torch-xla==2.1.3
[pip3] torchvision==0.16.2
[pip3] transformers==4.42.3
[pip3] transformers-neuronx==0.11.351
[pip3] triton==2.1.0
[conda] Could not collect
ROCM Version: Could not collect
vLLM Version: 0.5.0
areanddee commented 2 months ago

Well, I'm a bit confused. I sourced, per your instructions, /opt/aws_neuronx_venv_transformers_neuronx/bin/activate on a clean 2.19.0 instance. Installing vllm IS step 3, so I'm confused about that and I'm confused when you say install vllm there. Where is there? What I did was install vllm in /root/vllm with the venv activated. i.e. with aws_neuronx_venv_transformers_neuronx activated I git cloned vllm in /root; entered vllm; ran pip install -U -r requirements-neuron.txt and then ran pip install . The OPT-125M example code still gives the same error. I also can't see any "meaningful" differences between your collect_env.py and mine. I am running transformers=4.42.4 and vLLM version 0.5.1. Would either of those make a difference?

I want to make sure working is defined as this code:

from vllm import LLM, SamplingParams

prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="facebook/opt-125m",device="neuron") outputs = llm.generate(prompts, sampling_params)

Print the outputs.

for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

not producing a TypeError: Can't instantiate abstract class NeuronWorker with abstract method execute_worker

jianyinglangaws commented 2 months ago

The vllm 0.5.1 does not work. Downgrade it to 0.5.0. Also change to llama2 model in the example.

areanddee commented 2 months ago

The vllm 0.5.1 does not work. Downgrade it to 0.5.0. Also change to llama2 model in the example.

OK, I switched to TinyLlama. If that works, I'll try to find a llama2-13b example to test...

areanddee commented 2 months ago

I got TinyLlama to work after applying the patch to NeuronExecutor(ExecutorBase) recommended in https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide-for-continuous-batching.html#vllm-v0-5-0-neuron-patch.

I want to thank you for sticking with me on this issue to get this far.

My ultimate goal is to run LLAMA-3 8B and LLAMA-3 70B on Neuron using vllm. Any advice or input regarding the feasibility or steps to take to reach this goal would be appreciated.

servient-ashwin commented 2 months ago
Collecting environment information...
WARNING 07-15 19:13:04 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-1020-aws-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             4
On-line CPU(s) list:                0-3
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7R13 Processor
CPU family:                         25
Model:                              1
Thread(s) per core:                 2
Core(s) per socket:                 2
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           5299.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          64 KiB (2 instances)
L1i cache:                          64 KiB (2 instances)
L2 cache:                           1 MiB (2 instances)
L3 cache:                           8 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-3
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.25.2
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] torch==2.1.2
[pip3] torch-neuronx==2.1.2.2.2.0
[pip3] torch-xla==2.1.3
[pip3] torchvision==0.16.2
[pip3] transformers==4.42.4
[pip3] transformers-neuronx==0.11.351
[pip3] triton==2.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: (0, 'instance-type: inf2.xlarge\ninstance-id: i-01bbe72eda7217750\n+--------+--------+--------+---------+\n| NEURON | NEURON | NEURON |   PCI   |\n| DEVICE | CORES  | MEMORY |   BDF   |\n+--------+--------+--------+---------+\n| 0      | 2      | 32 GB  | 00:1f.0 |\n+--------+--------+--------+---------+', '')
vLLM Version: 0.5.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

I am getting the same error even now

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 282, in <module>
    run_server(args)
  File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 224, in run_server
    if llm_engine is not None else AsyncLLMEngine.from_engine_args(
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 444, in from_engine_args
    engine = cls(
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 373, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 520, in _init_engine
    return engine_class(*args, **kwargs)
  File "/root/vllm/vllm/engine/llm_engine.py", line 249, in __init__
    self.model_executor = executor_class(
  File "/root/vllm/vllm/executor/executor_base.py", line 150, in __init__
    super().__init__(model_config, cache_config, parallel_config,
  File "/root/vllm/vllm/executor/executor_base.py", line 46, in __init__
    self._init_executor()
  File "/root/vllm/vllm/executor/neuron_executor.py", line 21, in _init_executor
    self._init_worker()
  File "/root/vllm/vllm/executor/neuron_executor.py", line 26, in _init_worker
    self.driver_worker = NeuronWorker(
TypeError: Can't instantiate abstract class NeuronWorker with abstract method execute_worker

Is downgrading to 0.5.0 the only working solution right now? I am working with Mistral Instruct v0.3

areanddee commented 2 months ago
Collecting environment information...
WARNING 07-15 19:13:04 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-1020-aws-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             4
On-line CPU(s) list:                0-3
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7R13 Processor
CPU family:                         25
Model:                              1
Thread(s) per core:                 2
Core(s) per socket:                 2
Socket(s):                          1
Stepping:                           1
BogoMIPS:                           5299.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          64 KiB (2 instances)
L1i cache:                          64 KiB (2 instances)
L2 cache:                           1 MiB (2 instances)
L3 cache:                           8 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-3
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.25.2
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] torch==2.1.2
[pip3] torch-neuronx==2.1.2.2.2.0
[pip3] torch-xla==2.1.3
[pip3] torchvision==0.16.2
[pip3] transformers==4.42.4
[pip3] transformers-neuronx==0.11.351
[pip3] triton==2.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: (0, 'instance-type: inf2.xlarge\ninstance-id: i-01bbe72eda7217750\n+--------+--------+--------+---------+\n| NEURON | NEURON | NEURON |   PCI   |\n| DEVICE | CORES  | MEMORY |   BDF   |\n+--------+--------+--------+---------+\n| 0      | 2      | 32 GB  | 00:1f.0 |\n+--------+--------+--------+---------+', '')
vLLM Version: 0.5.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect

I am getting the same error even now

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 282, in <module>
    run_server(args)
  File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 224, in run_server
    if llm_engine is not None else AsyncLLMEngine.from_engine_args(
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 444, in from_engine_args
    engine = cls(
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 373, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/root/vllm/vllm/engine/async_llm_engine.py", line 520, in _init_engine
    return engine_class(*args, **kwargs)
  File "/root/vllm/vllm/engine/llm_engine.py", line 249, in __init__
    self.model_executor = executor_class(
  File "/root/vllm/vllm/executor/executor_base.py", line 150, in __init__
    super().__init__(model_config, cache_config, parallel_config,
  File "/root/vllm/vllm/executor/executor_base.py", line 46, in __init__
    self._init_executor()
  File "/root/vllm/vllm/executor/neuron_executor.py", line 21, in _init_executor
    self._init_worker()
  File "/root/vllm/vllm/executor/neuron_executor.py", line 26, in _init_worker
    self.driver_worker = NeuronWorker(
TypeError: Can't instantiate abstract class NeuronWorker with abstract method execute_worker

Is downgrading to 0.5.0 the only working solution right now?

Yes it appears so. Also be advised that you have to apply the patch located here

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide-for-continuous-batching.html#vllm-v0-5-0-neuron-patch

before building the vllm 0.5.0

servient-ashwin commented 2 months ago

@areanddee Can you share how you were able to get TinyLlama working on the machine? It seems from your comments that 0.5.0 and patch should do the job? or did you do something else as well. I keep getting this

INFO 07-16 21:28:27 config.py:1214] Downcasting torch.float32 to torch.float16.
INFO 07-16 21:28:27 llm_engine.py:161] Initializing an LLM engine (v0.5.0) with config: model='TinyLlama/TinyLlama_v1.1', speculative_config=None, tokenizer='TinyLlama/TinyLlama_v1.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=TinyLlama/TinyLlama_v1.1)
WARNING 07-16 21:28:27 utils.py:456] Pin memory is not supported on Neuron.
2024-07-16 21:29:08.000341:  1213  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-07-16 21:29:08.000346:  1213  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/root/neuroncc_compile_workdir/a3dc91b7-a9c7-41c5-b4c7-9d86f03f370b/model.MODULE_9d35962a695208729704+2c2d707e.hlo_module.pb --output /tmp/root/neuroncc_compile_workdir/a3dc91b7-a9c7-41c5-b4c7-9d86f03f370b/model.MODULE_9d35962a695208729704+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35
2024-07-16 21:29:08.000406:  1214  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-07-16 21:29:08.000408:  1214  ERROR ||NEURON_CC_WRAPPER||: Got a cached failed neff at /var/tmp/neuron-compile-cache/neuronxcc-2.14.213.0+013d129b/MODULE_ffa1e1fbf98119ebffaf+2c2d707e/model.neff. Will skip compilation, please set --retry_failed_compilation for recompilation:
 Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/root/neuroncc_compile_workdir/7e69684d-45d3-42b4-b735-1eb35d4eafa9/model.MODULE_ffa1e1fbf98119ebffaf+2c2d707e.hlo_module.pb', '--output', '/tmp/root/neuroncc_compile_workdir/7e69684d-45d3-42b4-b735-1eb35d4eafa9/model.MODULE_ffa1e1fbf98119ebffaf+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']: 2024-07-16T21:26:23Z [F137] neuronx-cc was forcibly killed - This most commonly occurs due to insufficient system memory. Using a smaller data type, dimensions, batch size, or a larger instance type may help.
.
2024-07-16 21:29:08.000408:  1214  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache

This tells me about a memory problem, but if full precision and even 16 bit isn't supported, what is then supported? fp8 or something else? I believe if this is an issue for a small model like tiny llama, then it may also likely is the cause for the issue I opened #6452 . However vLLM doesn't change the precision while loading mistral 7b models which is weird.

liangfu commented 1 month ago

I'm looking at the config:

max_seq_len=2048
tensor_parallel_size=1

It seems like the setup is intend to run TinyLllama on one NeuronCore to support 2048 sequence length.

Before stretching to 2048 sequence on 1 NeuronCore_v2, are you able to reproduce the setup here? https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_neuron.py

(The offline inference example demonstrates TinyLlama inference with two NeuronCore_v2, while limiting sequence length to 128.)

omrishiv commented 1 month ago

I think this should be reopened. The issue here is an unimplemented abstract function. I'm working on fixing this as well as a few other issues that have come up after 0.5.0 cc #6640