vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.67k stars 4.08k forks source link

[Bug]: GPTQ Marlin with cpu-offload-gb fails on `0.5.4` #7204

Closed w013nad closed 1 month ago

w013nad commented 1 month ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.31

Python version: 3.10.14 (main, Apr  6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-4.18.0-425.19.2.el8_7.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB

Nvidia driver version: 525.105.17
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          256
On-line CPU(s) list:             0-255
Thread(s) per core:              2
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    8
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7742 64-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         3391.018
CPU max MHz:                     2250.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        4491.45
Virtualization:                  AMD-V
L1d cache:                       4 MiB
L1i cache:                       4 MiB
L2 cache:                        64 MiB
L3 cache:                        512 MiB
NUMA node0 CPU(s):               0-15,128-143
NUMA node1 CPU(s):               16-31,144-159
NUMA node2 CPU(s):               32-47,160-175
NUMA node3 CPU(s):               48-63,176-191
NUMA node4 CPU(s):               64-79,192-207
NUMA node5 CPU(s):               80-95,208-223
NUMA node6 CPU(s):               96-111,224-239
NUMA node7 CPU(s):               112-127,240-255
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es

Versions of relevant libraries:
[pip3] flashinfer==0.0.9+cu121torch2.3
[pip3] numpy==1.26.4
[pip3] torch==2.3.1
[pip3] torchvision==0.18.1
[pip3] triton==2.3.1
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    CPU Affinity    NUMA Affinity
GPU0     X      NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3
GPU1    NV12     X      PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3
NIC0    PXB     PXB      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    PXB     PXB     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9

🐛 Describe the bug

I'm running vllm 0.5.4. I was trying to run a GPTQ model with cpu offloading. This should have been fixed with #6960 but it appears not.

python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/ndurkee/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4 -tp 4 --gpu-memory-utilization 0.79 --dtype auto --distributed-executor-backend mp --port 5006 --served-model-name /home/ndurkee/temp/llama3_70b_fixed/ --max-model-len 1000 --max-log-len 10 --use-v2-block-manager --disable-custom-all-reduce --enable-prefix-caching --cpu-offload-gb 30
root@428f68245052:/vllm-workspace# python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/ndurkee/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4 -tp 4 --gpu-memory-utilization 0.79 --dtype auto --distributed-executor-backend mp --port 5006 --served-model-name /home/ndurkee/temp/llama3_70b_fixed/ --max-model-len 1000 --max-log-len 10 --use-v2-block-manager --disable-custom-all-reduce --enable-prefix-caching --cpu-offload-gb 30
INFO 08-06 12:48:23 api_server.py:219] vLLM API server version 0.5.3.post1
INFO 08-06 12:48:23 api_server.py:220] args: Namespace(host=None, port=5006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/home/ndurkee/ndurkee/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1000, guided_decoding_backend='outlines', distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=30.0, gpu_memory_utilization=0.79, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['/home/ndurkee/temp/llama3_70b_fixed/'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=10)
INFO 08-06 12:48:23 gptq_marlin.py:87] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 08-06 12:48:23 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/home/ndurkee/ndurkee/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4', speculative_config=None, tokenizer='/home/ndurkee/ndurkee/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/ndurkee/temp/llama3_70b_fixed/, use_v2_block_manager=True, enable_prefix_caching=True)
INFO 08-06 12:48:24 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=83) INFO 08-06 12:48:24 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=84) INFO 08-06 12:48:24 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=85) INFO 08-06 12:48:24 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
INFO 08-06 12:48:26 utils.py:784] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=84) INFO 08-06 12:48:26 utils.py:784] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=85) INFO 08-06 12:48:26 utils.py:784] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=83) INFO 08-06 12:48:26 utils.py:784] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=84) INFO 08-06 12:48:26 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-06 12:48:26 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=85) INFO 08-06 12:48:26 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=83) INFO 08-06 12:48:26 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 08-06 12:48:27 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f198dd419f0>, local_subscribe_port=60591, local_sync_port=35463, remote_subscribe_port=None, remote_sync_port=None)
INFO 08-06 12:48:27 model_runner.py:680] Starting to load model /home/ndurkee/ndurkee/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4...
(VllmWorkerProcess pid=84) INFO 08-06 12:48:27 model_runner.py:680] Starting to load model /home/ndurkee/ndurkee/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4...
(VllmWorkerProcess pid=85) INFO 08-06 12:48:27 model_runner.py:680] Starting to load model /home/ndurkee/ndurkee/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4...
(VllmWorkerProcess pid=83) INFO 08-06 12:48:27 model_runner.py:680] Starting to load model /home/ndurkee/ndurkee/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4...
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: Cannot copy out of meta tensor; no data!, Traceback (most recent call last):
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]     self.model_runner.load_model()
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 682, in load_model
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]     self.model = get_model(model_config=self.model_config,
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]     return loader.load_model(model_config=model_config,
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 280, in load_model
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]     model = _initialize_model(model_config, self.load_config,
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 111, in _initialize_model
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]     return model_class(config=model_config.hf_config,
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 384, in __init__
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]     self.model = LlamaModel(config,
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 285, in __init__
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]     self.start_layer, self.end_layer, self.layers = make_layers(
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 144, in make_layers
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]     [PPMissingLayer() for _ in range(start_layer)] + [
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 145, in <listcomp>
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 102, in maybe_offload_to_cpu
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]     cpu_data.copy_(p.data)
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 78, in __torch_function__
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226] NotImplementedError: Cannot copy out of meta tensor; no data!
(VllmWorkerProcess pid=83) ERROR 08-06 12:48:27 multiproc_worker_utils.py:226]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 317, in <module>
[rank0]:     run_server(args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 251, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 201, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 124, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 178, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 682, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 280, in load_model
[rank0]:     model = _initialize_model(model_config, self.load_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 111, in _initialize_model
[rank0]:     return model_class(config=model_config.hf_config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 384, in __init__
[rank0]:     self.model = LlamaModel(config,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 285, in __init__
[rank0]:     self.start_layer, self.end_layer, self.layers = make_layers(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 144, in make_layers
[rank0]:     [PPMissingLayer() for _ in range(start_layer)] + [
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 145, in <listcomp>
[rank0]:     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 102, in maybe_offload_to_cpu
[rank0]:     cpu_data.copy_(p.data)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 78, in __torch_function__
[rank0]:     return func(*args, **kwargs)
[rank0]: NotImplementedError: Cannot copy out of meta tensor; no data!
ERROR 08-06 12:48:27 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 83 died, exit code: -15
INFO 08-06 12:48:27 multiproc_worker_utils.py:123] Killing local vLLM worker processes
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
w013nad commented 1 month ago

It appears it's broken for quantization in general even without the cpu offload.

python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/ndurkee/Meta-Llama-3.1-70B-Instruct/ --max-model-len 90000 -tp 4 --gpu-memory-utilization 0.99 --dtype auto --distributed-executor-backend mp --port 15001 --served-model-name /home/ndurkee/temp/llama3_70b_fixed/  --max-log-len 10 --use-v2-block-manager --disable-custom-all-reduce --enable-prefix-caching --quantization='fp8'
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: shape '[-1, 32]' is invalid for input of size 1, Traceback (most recent call last):
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     self.model_runner.load_model()
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 722, in load_model
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     self.model = get_model(model_config=self.model_config,
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     return loader.load_model(model_config=model_config,
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 344, in load_model
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     quant_method.process_weights_after_loading(module)
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 212, in process_weights_after_loading
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     prepare_fp8_layer_for_marlin(layer)
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 80, in prepare_fp8_layer_for_marlin
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     marlin_scales = marlin_permute_scales(s=scales,
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 172, in marlin_permute_scales
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     s = s.reshape((-1, len(scale_perm_single)))[:, scale_perm_single]
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] RuntimeError: shape '[-1, 32]' is invalid for input of size 1
(VllmWorkerProcess pid=144) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: shape '[-1, 32]' is invalid for input of size 1, Traceback (most recent call last):
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     self.model_runner.load_model()
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 722, in load_model
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     self.model = get_model(model_config=self.model_config,
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     return loader.load_model(model_config=model_config,
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 344, in load_model
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     quant_method.process_weights_after_loading(module)
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 212, in process_weights_after_loading
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     prepare_fp8_layer_for_marlin(layer)
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 80, in prepare_fp8_layer_for_marlin
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     marlin_scales = marlin_permute_scales(s=scales,
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 172, in marlin_permute_scales
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226]     s = s.reshape((-1, len(scale_perm_single)))[:, scale_perm_single]
(VllmWorkerProcess pid=143) ERROR 08-06 14:49:20 multiproc_worker_utils.py:226] RuntimeError: shape '[-1, 32]' is invalid for input of size 1
youkaichao commented 1 month ago

cc @mgoin for quantization and cpu offloading. I feel this is a quantization issue, and it might be related with your quantized model.

@w013nad do you have a hf link for the model you try to use?

w013nad commented 1 month ago

cc @mgoin for quantization and cpu offloading. I feel this is a quantization issue, and it might be related with your quantized model.

@w013nad do you have a hf link for the model you try to use?

https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct

mgoin commented 1 month ago

I will look into this, but are you sure you are using 0.5.4? In your logs and collect env output, it mentions 0.5.3.post1

vLLM Version: 0.5.3.post1

and

INFO 08-06 12:48:23 api_server.py:219] vLLM API server version 0.5.3.post1
w013nad commented 1 month ago

Shoot, some of this was with a prerelease wheel. There seems to be 2 separate issues here:

  1. fp8 doesn't work at all
    
    root@96aed4dedb69:/home/ndurkee# python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/Llama-3-8B-Instruct/ -tp 4 --gpu-memory-utilization 0.79 --dtype auto --distributed-executor-backend mp --port 5006 --served-model-name /home/ndurkee/temp/llama3_70b_fixed/ --max-model-len 1000 --max-log-len 10 --use-v2-block-manager --disable-custom-all-reduce --enable-prefix-caching --quantization='fp8'
    INFO 08-06 18:47:55 api_server.py:339] vLLM API server version 0.5.4
    INFO 08-06 18:47:55 api_server.py:340] args: Namespace(host=None, port=5006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/home/ndurkee/Llama-3-8B-Instruct/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1000, guided_decoding_backend='outlines', distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.79, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='fp8', rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['/home/ndurkee/temp/llama3_70b_fixed/'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=10)
    WARNING 08-06 18:47:56 config.py:1454] Casting torch.bfloat16 to torch.float16.
    INFO 08-06 18:47:56 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/home/ndurkee/Llama-3-8B-Instruct/', speculative_config=None, tokenizer='/home/ndurkee/Llama-3-8B-Instruct/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/ndurkee/temp/llama3_70b_fixed/, use_v2_block_manager=True, enable_prefix_caching=True)
    WARNING 08-06 18:47:56 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 128 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
    INFO 08-06 18:47:56 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
    (VllmWorkerProcess pid=3133) INFO 08-06 18:47:56 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
    (VllmWorkerProcess pid=3134) INFO 08-06 18:47:56 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
    (VllmWorkerProcess pid=3135) INFO 08-06 18:47:56 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
    INFO 08-06 18:47:58 utils.py:841] Found nccl from library libnccl.so.2
    (VllmWorkerProcess pid=3135) INFO 08-06 18:47:58 utils.py:841] Found nccl from library libnccl.so.2
    (VllmWorkerProcess pid=3134) INFO 08-06 18:47:58 utils.py:841] Found nccl from library libnccl.so.2
    (VllmWorkerProcess pid=3133) INFO 08-06 18:47:58 utils.py:841] Found nccl from library libnccl.so.2
    (VllmWorkerProcess pid=3135) INFO 08-06 18:47:58 pynccl.py:63] vLLM is using nccl==2.20.5
    INFO 08-06 18:47:58 pynccl.py:63] vLLM is using nccl==2.20.5
    (VllmWorkerProcess pid=3134) INFO 08-06 18:47:58 pynccl.py:63] vLLM is using nccl==2.20.5
    (VllmWorkerProcess pid=3133) INFO 08-06 18:47:58 pynccl.py:63] vLLM is using nccl==2.20.5
    INFO 08-06 18:47:59 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f9612647eb0>, local_subscribe_port=47097, remote_subscribe_port=None)
    INFO 08-06 18:47:59 model_runner.py:720] Starting to load model /home/ndurkee/Llama-3-8B-Instruct/...
    (VllmWorkerProcess pid=3134) INFO 08-06 18:47:59 model_runner.py:720] Starting to load model /home/ndurkee/Llama-3-8B-Instruct/...
    (VllmWorkerProcess pid=3135) INFO 08-06 18:47:59 model_runner.py:720] Starting to load model /home/ndurkee/Llama-3-8B-Instruct/...
    (VllmWorkerProcess pid=3133) INFO 08-06 18:47:59 model_runner.py:720] Starting to load model /home/ndurkee/Llama-3-8B-Instruct/...
    Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
    Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  2.62it/s]
    Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:00,  2.63it/s]
    Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:00<00:00,  3.67it/s]
    Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  3.26it/s]
    Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  3.16it/s]

WARNING 08-06 18:48:01 utils.py:578] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads. ERROR 08-06 18:48:01 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 3135 died, exit code: -15 INFO 08-06 18:48:01 multiproc_worker_utils.py:123] Killing local vLLM worker processes Process Process-1: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server server = AsyncEngineRPCServer(async_engine_args, usage_context, port) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in init self.engine = AsyncLLMEngine.from_engine_args(async_engine_args, File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args engine = cls( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in init self.engine = self._init_engine(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine return engine_class(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 249, in init self.model_executor = executor_class( File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in init super().init(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init super().init(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in init self._init_executor() File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 138, in _init_executor self._run_workers("load_model", File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers driver_worker_output = driver_worker_method(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model self.model_runner.load_model() File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 722, in load_model self.model = get_model(model_config=self.model_config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model return loader.load_model(model_config=model_config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 344, in load_model quant_method.process_weights_after_loading(module) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 212, in process_weights_after_loading prepare_fp8_layer_for_marlin(layer) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils_fp8.py", line 80, in prepare_fp8_layer_for_marlin marlin_scales = marlin_permute_scales(s=scales, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 172, in marlin_permute_scales s = s.reshape((-1, len(scale_perm_single)))[:, scale_perm_single] RuntimeError: shape '[-1, 32]' is invalid for input of size 1 /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' ^CTraceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 370, in asyncio.run(run_server(args)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once event_list = self._selector.select(timeout) File "/usr/lib/python3.10/selectors.py", line 469, in select fd_event_list = self._selector.poll(timeout, max_ev) KeyboardInterrupt


2. GPTQ cpu offload doesn't work

root@96aed4dedb69:/home/ndurkee# python3 -m vllm.entrypoints.openai.api_server --model /home/ndurkee/temp/llama3_8b_gptq -tp 4 --gpu-memory-utilization 0.79 --dtype auto --distributed-executor-backend mp --port 5006 --served-model-name /home/ndurkee/temp/llama3_70b_fixed/ --max-model-len 1000 --max-log-len 10 --use-v2-block-manager --disable-custom-all-reduce --enable-prefix-caching --cpu-offload-gb 5 INFO 08-06 18:45:29 api_server.py:339] vLLM API server version 0.5.4 INFO 08-06 18:45:29 api_server.py:340] args: Namespace(host=None, port=5006, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=[''], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/home/ndurkee/temp/llama3_8b_gptq', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=1000, guided_decoding_backend='outlines', distributed_executor_backend='mp', worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=5.0, gpu_memory_utilization=0.79, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=True, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['/home/ndurkee/temp/llama3_70b_fixed/'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=10) INFO 08-06 18:45:29 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel. INFO 08-06 18:45:29 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel. INFO 08-06 18:45:29 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/home/ndurkee/temp/llama3_8b_gptq', speculative_config=None, tokenizer='/home/ndurkee/temp/llama3_8b_gptq', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/home/ndurkee/temp/llama3_70b_fixed/, use_v2_block_manager=True, enable_prefix_caching=True) WARNING 08-06 18:45:29 multiproc_gpu_executor.py:59] Reducing Torch parallelism from 128 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 08-06 18:45:29 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager (VllmWorkerProcess pid=2602) INFO 08-06 18:45:30 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=2603) INFO 08-06 18:45:30 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=2604) INFO 08-06 18:45:30 multiproc_worker_utils.py:215] Worker ready; awaiting tasks (VllmWorkerProcess pid=2602) INFO 08-06 18:45:31 utils.py:841] Found nccl from library libnccl.so.2 INFO 08-06 18:45:31 utils.py:841] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=2603) INFO 08-06 18:45:31 utils.py:841] Found nccl from library libnccl.so.2 (VllmWorkerProcess pid=2602) INFO 08-06 18:45:31 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=2604) INFO 08-06 18:45:31 utils.py:841] Found nccl from library libnccl.so.2 INFO 08-06 18:45:31 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=2603) INFO 08-06 18:45:31 pynccl.py:63] vLLM is using nccl==2.20.5 (VllmWorkerProcess pid=2604) INFO 08-06 18:45:31 pynccl.py:63] vLLM is using nccl==2.20.5 INFO 08-06 18:45:32 shm_broadcast.py:235] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fbf8e91a590>, local_subscribe_port=57567, remote_subscribe_port=None) INFO 08-06 18:45:32 model_runner.py:720] Starting to load model /home/ndurkee/temp/llama3_8b_gptq... (VllmWorkerProcess pid=2602) INFO 08-06 18:45:32 model_runner.py:720] Starting to load model /home/ndurkee/temp/llama3_8b_gptq... (VllmWorkerProcess pid=2603) INFO 08-06 18:45:32 model_runner.py:720] Starting to load model /home/ndurkee/temp/llama3_8b_gptq... (VllmWorkerProcess pid=2604) INFO 08-06 18:45:32 model_runner.py:720] Starting to load model /home/ndurkee/temp/llama3_8b_gptq... (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: Cannot copy out of meta tensor; no data!, Traceback (most recent call last): (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] output = executor(args, kwargs) (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] self.model_runner.load_model() (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 722, in load_model (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] self.model = get_model(model_config=self.model_config, (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] return loader.load_model(model_config=model_config, (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 324, in load_model (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] model = _initialize_model(model_config, self.load_config, (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 154, in _initialize_model (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] return model_class(config=model_config.hf_config, (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 384, in init (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] self.model = LlamaModel(config, (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 285, in init (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] self.start_layer, self.end_layer, self.layers = make_layers( (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 146, in make_layers (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_workerutils.py:226] [PPMissingLayer() for in range(start_layer)] + [ (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 147, in (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 104, in maybe_offload_to_cpu (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] cpudata.copy(p.data) (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function__ (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] return func(*args, *kwargs) (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] NotImplementedError: Cannot copy out of meta tensor; no data! (VllmWorkerProcess pid=2604) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method load_model: Cannot copy out of meta tensor; no data!, Traceback (most recent call last): (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] output = executor(args, kwargs) (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] self.model_runner.load_model() (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 722, in load_model (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] self.model = get_model(model_config=self.model_config, (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] return loader.load_model(model_config=model_config, (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 324, in load_model (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] model = _initialize_model(model_config, self.load_config, (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 154, in _initialize_model (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] return model_class(config=model_config.hf_config, (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 384, in init (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] self.model = LlamaModel(config, (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 285, in init (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] self.start_layer, self.end_layer, self.layers = make_layers( (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 146, in make_layers (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_workerutils.py:226] [PPMissingLayer() for in range(start_layer)] + [ (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 147, in (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 104, in maybe_offload_to_cpu (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] cpudata.copy(p.data) (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in torch_function__ (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] return func(*args, *kwargs) (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] NotImplementedError: Cannot copy out of meta tensor; no data! (VllmWorkerProcess pid=2603) ERROR 08-06 18:45:33 multiproc_worker_utils.py:226] ERROR 08-06 18:45:33 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 2602 died, exit code: -15 INFO 08-06 18:45:33 multiproc_worker_utils.py:123] Killing local vLLM worker processes Process Process-1: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server server = AsyncEngineRPCServer(async_engine_args, usage_context, port) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in init self.engine = AsyncLLMEngine.from_engine_args(async_engine_args, File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args engine = cls( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in init self.engine = self._init_engine(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine return engine_class(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 249, in init self.model_executor = executor_class( File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in init super().init(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init super().init(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in init self._init_executor() File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 138, in _init_executor self._run_workers("load_model", File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers driver_worker_output = driver_worker_method(args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 139, in load_model self.model_runner.load_model() File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 722, in load_model self.model = get_model(model_config=self.model_config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model return loader.load_model(model_config=model_config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 324, in load_model model = _initialize_model(model_config, self.load_config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 154, in _initialize_model return model_class(config=model_config.hf_config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 384, in init self.model = LlamaModel(config, File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/llama.py", line 285, in init self.start_layer, self.end_layer, self.layers = make_layers( File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 146, in makelayers [PPMissingLayer() for in range(start_layer)] + [ File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 147, in maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 104, in maybe_offload_to_cpu cpudata.copy(p.data) File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 79, in __torch_function return func(*args, **kwargs) NotImplementedError: Cannot copy out of meta tensor; no data! /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' ^CTraceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 370, in asyncio.run(run_server(args)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once event_list = self._selector.select(timeout) File "/usr/lib/python3.10/selectors.py", line 469, in select fd_event_list = self._selector.poll(timeout, max_ev) KeyboardInterrupt



GPTQ does work by itself. Note that this is on A100s.
mgoin commented 1 month ago

Okay I confirmed dynamic FP8 works fine on H100 but fails on A100. This is an issue with the dynamic FP8 Marlin backend.

vllm serve meta-llama/Meta-Llama-3-8B-Instruct --quantization="fp8" --port 9000 
...
  File "/home/mgoin/venvs/vllm-rel/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 172, in marlin_permute_scales
    s = s.reshape((-1, len(scale_perm_single)))[:, scale_perm_single]
RuntimeError: shape '[-1, 32]' is invalid for input of size 1

It does work fine with models that are already quantized to FP8 on A100:

vllm serve neuralmagic/Meta-Llama-3-8B-Instruct-FP8 --quantization="fp8" --port 9000
...
INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)

I opened a tracking issue here: https://github.com/vllm-project/vllm/issues/7216 Looking into this first

mgoin commented 1 month ago

@w013nad If you can build and test from source, please try my PR to fix dynamic FP8 Marlin https://github.com/vllm-project/vllm/pull/7219. It seems to fix the issue from my reproduction

I will look into GPTQ cpu offloading now

mgoin commented 1 month ago

Verified that forcing GPTQ with cpu offload works:

vllm serve Qwen/Qwen2-0.5B-Instruct-GPTQ-Int4 --cpu-offload-gb 5 --quantization gptq
...
INFO 08-06 21:20:41 gptq_marlin.py:102] Detected that the model can run with gptq_marlin, however you specified quantization=gptq explicitly, so forcing gptq. Use quantization=gptq_marlin for faster inference
...
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

The issue is specifically with GPTQ Marlin:

vllm serve Qwen/Qwen2-0.5B-Instruct-GPTQ-Int4 --cpu-offload-gb 5  
...
INFO 08-06 21:21:46 gptq_marlin.py:98] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
...
  File "/home/mgoin/code/vllm/vllm/model_executor/models/utils.py", line 195, in <listcomp>
    maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
  File "/home/mgoin/code/vllm/vllm/model_executor/models/utils.py", line 152, in maybe_offload_to_cpu
    cpu_data.copy_(p.data)
  File "/home/mgoin/venvs/vllm/lib/python3.10/site-packages/torch/utils/_device.py", line 79, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!
mgoin commented 1 month ago

@w013nad ditto for the GPTQ Marlin fix linked above ^

Thank you very much for reporting these issues and my apologies for letting them slip through this release. I added explicit tests for both of these cases so they will be caught in automation going forward.

w013nad commented 1 month ago

Sorry I'm not able to build from source. I'm stuck using your nightly pypi packages or docker images due to it being a closed environment.

fzyzcjy commented 1 month ago

Looking forward to seeing this fix be released! (I am seeing the same problem)