vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.89k stars 4.51k forks source link

[Bug]: CUDA_VISIBLE_DEVICES not detected #7890

Closed paolovic closed 2 months ago

paolovic commented 2 months ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... WARNING 08-27 02:59:41 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead. See https://pypi.org/project/pynvml for more information. PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Red Hat Enterprise Linux release 8.10 (Ootpa) (x86_64) GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22) Clang version: Could not collect CMake version: version 3.29.0 Libc version: glibc-2.28 Python version: 3.11.9 (main, Jun 19 2024, 10:02:06) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)] (64-bit runtime) Python platform: Linux-4.18.0-553.8.1.el8_10.x86_64-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L40S-48C GPU 1: NVIDIA L40S-48C GPU 2: NVIDIA L40S-48C Nvidia driver version: 535.129.03 cuDNN version: Probably one of the following: /usr/lib64/libcudnn.so.9.3.0 /usr/lib64/libcudnn_adv.so.9.3.0 /usr/lib64/libcudnn_cnn.so.9.3.0 /usr/lib64/libcudnn_engines_precompiled.so.9.3.0 /usr/lib64/libcudnn_engines_runtime_compiled.so.9.3.0 /usr/lib64/libcudnn_graph.so.9.3.0 /usr/lib64/libcudnn_heuristic.so.9.3.0 /usr/lib64/libcudnn_ops.so.9.3.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 1 Core(s) per socket: 12 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Platinum 8462Y+ Stepping: 8 CPU MHz: 2799.999 BogoMIPS: 5599.99 Hypervisor vendor: VMware Virtualization type: full L1d cache: 48K L1i cache: 32K L2 cache: 2048K L3 cache: 61440K NUMA node0 CPU(s): 0-11 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear flush_l1d arch_capabilities Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.555.43 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.4.127 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pynvml==11.5.0 [pip3] pyzmq==26.2.0 [pip3] sentence-transformers==2.5.1 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [pip3] vllm_nccl_cu12==2.18.1.0.4.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.5@e397b92f84b7771cfd04b8fbb87894e9ec95f873 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX PIX 0-11 0 N/A GPU1 PIX X PIX 0-11 0 N/A GPU2 PIX PIX X 0-11 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

Hi,

I am trying to execute the following llm.py from https://docs.ray.io/en/latest/serve/tutorials/vllm-example.html

from typing import Dict, Optional, List
import logging

from fastapi import FastAPI
from starlette.requests import Request
from starlette.responses import StreamingResponse, JSONResponse

from ray import serve

from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.entrypoints.openai.cli_args import make_arg_parser
from vllm.entrypoints.openai.protocol import (
    ChatCompletionRequest,
    ChatCompletionResponse,
    ErrorResponse,
)
from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
from vllm.entrypoints.openai.serving_engine import LoRAModulePath
from vllm.utils import FlexibleArgumentParser

logger = logging.getLogger("ray.serve")

app = FastAPI()

@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 10,
        "target_ongoing_requests": 5,
    },
    max_ongoing_requests=10,
)
@serve.ingress(app)
class VLLMDeployment:
    def __init__(
        self,
        engine_args: AsyncEngineArgs,
        response_role: str,
        lora_modules: Optional[List[LoRAModulePath]] = None,
        chat_template: Optional[str] = None,
    ):
        logger.info(f"Starting with engine args: {engine_args}")
        self.openai_serving_chat = None
        self.engine_args = engine_args
        self.response_role = response_role
        self.lora_modules = lora_modules
        self.chat_template = chat_template
        self.engine = AsyncLLMEngine.from_engine_args(engine_args)

    @app.post("/v1/chat/completions")
    async def create_chat_completion(
        self, request: ChatCompletionRequest, raw_request: Request
    ):
        """OpenAI-compatible HTTP endpoint.

        API reference:
            - https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
        """
        if not self.openai_serving_chat:
            model_config = await self.engine.get_model_config()
            # Determine the name of the served model for the OpenAI client.
            if self.engine_args.served_model_name is not None:
                served_model_names = self.engine_args.served_model_name
            else:
                served_model_names = [self.engine_args.model]
            self.openai_serving_chat = OpenAIServingChat(
                self.engine,
                model_config,
                served_model_names,
                self.response_role,
                self.lora_modules,
                self.chat_template,
            )
        logger.info(f"Request: {request}")
        generator = await self.openai_serving_chat.create_chat_completion(
            request, raw_request
        )
        if isinstance(generator, ErrorResponse):
            return JSONResponse(
                content=generator.model_dump(), status_code=generator.code
            )
        if request.stream:
            return StreamingResponse(content=generator, media_type="text/event-stream")
        else:
            assert isinstance(generator, ChatCompletionResponse)
            return JSONResponse(content=generator.model_dump())

def parse_vllm_args(cli_args: Dict[str, str]):
    """Parses vLLM args based on CLI inputs.

    Currently uses argparse because vLLM doesn't expose Python models for all of the
    config options we want to support.
    """
    arg_parser = FlexibleArgumentParser(
        description="vLLM OpenAI-Compatible RESTful API server."
    )

    parser = make_arg_parser(arg_parser)
    arg_strings = []
    for key, value in cli_args.items():
        arg_strings.extend([f"--{key}", str(value)])
    logger.info(arg_strings)
    parsed_args = parser.parse_args(args=arg_strings)
    return parsed_args

def build_app(cli_args: Dict[str, str]) -> serve.Application:
    """Builds the Serve app based on CLI arguments.

    See https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#command-line-arguments-for-the-server
    for the complete set of arguments.

    Supported engine arguments: https://docs.vllm.ai/en/latest/models/engine_args.html.
    """  # noqa: E501
    parsed_args = parse_vllm_args(cli_args)
    engine_args = AsyncEngineArgs.from_cli_args(parsed_args)
    engine_args.worker_use_ray = True

    tp = engine_args.tensor_parallel_size
    logger.info(f"Tensor parallelism = {tp}")
    pg_resources = []
    pg_resources.append({"CPU": 1})  # for the deployment replica
    for i in range(tp):
        pg_resources.append({"CPU": 1, "GPU": 1})  # for the vLLM actors

    # We use the "STRICT_PACK" strategy below to ensure all vLLM actors are placed on
    # the same Ray node.
    return VLLMDeployment.options(
        placement_group_bundles=pg_resources, placement_group_strategy="STRICT_PACK"
    ).bind(
        engine_args,
        parsed_args.response_role,
        parsed_args.lora_modules,
        parsed_args.chat_template,
    )

I execute it like the following: serve run llm:build_app model="/u01/data/analytics/models/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/" tensor-parallel-size=2

Unfortunaley, it fails when trying to detect the CUDA_VISIBLE_DEVICES

(ServeReplica:default:VLLMDeployment pid=875095) INFO 2024-08-27 02:44:24,616 default_VLLMDeployment nxudbh1m llm.py:44 - Starting with engine args: AsyncEngineArgs(model='/models/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/', served_model_name=None, tokenizer='/models/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/', skip_tokenizer_init=False, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, seed=0, max_model_len=None, worker_use_ray=True, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, quantization=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256, long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None, device='auto', num_scheduler_steps=1, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, guided_decoding_backend='outlines', speculative_model=None, speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None, disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False)
(ServeReplica:default:VLLMDeployment pid=875095) ERROR 2024-08-27 02:44:24,692 default_VLLMDeployment nxudbh1m replica.py:1199 - Exception during graceful shutdown of replica: 'VLLMDeployment' object has no attribute '_serve_asgi_lifespan'
(ServeReplica:default:VLLMDeployment pid=875095)   File "/venv/lib64/python3.11/site-packages/ray/serve/_private/replica.py", line 1193, in call_destructor
(ServeReplica:default:VLLMDeployment pid=875095)     await self._call_func_or_gen(self._callable.__del__)
(ServeReplica:default:VLLMDeployment pid=875095)     result = await result
(ServeReplica:default:VLLMDeployment pid=875095)   File "/venv/lib64/python3.11/site-packages/ray/serve/api.py", line 225, in __del__
(ServeReplica:default:VLLMDeployment pid=875095)     await ASGIAppReplicaWrapper.__del__(self)
(ServeReplica:default:VLLMDeployment pid=875095)   File "/venv/lib64/python3.11/site-packages/ray/serve/_private/http_util.py", line 472, in __del__
(ServeReplica:default:VLLMDeployment pid=875095)     with LoggingContext(self._serve_asgi_lifespan.logger, level=logging.WARNING):
(ServeReplica:default:VLLMDeployment pid=875095) AttributeError: 'VLLMDeployment' object has no attribute '_serve_asgi_lifespan'
(ServeController pid=874997) INFO 2024-08-27 02:44:24,798 controller 874997 deployment_state.py:2182 - Replica(id='nxudbh1m', deployment='VLLMDeployment', app='default') is stopped.
^C2024-08-27 02:44:28,414       INFO scripts.py:585 -- Got KeyboardInterrupt, shutting down...
(ServeController pid=874997) INFO 2024-08-27 02:44:28,463 controller 874997 deployment_state.py:1860 - Removing 1 replica from Deployment(name='VLLMDeployment', app='default').
(ServeController pid=874997) INFO 2024-08-27 02:44:28,568 controller 874997 deployment_state.py:2182 - Replica(id='258tpy6w', deployment='VLLMDeployment', app='default') is stopped.
(ServeController pid=874997) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::ServeReplica:default:VLLMDeployment.initialize_and_get_metadata() (pid=875205, ip=159.103.253.239, actor_id=ef425b96a2e5be70fd0d8d8001000000, repr=<ray.serve._private.replica.ServeReplica:default:VLLMDeployment object at 0x7f3fd259e550>)
(ServeController pid=874997)   File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 449, in result
(ServeController pid=874997)     return self.__get_result()
(ServeController pid=874997)            ^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 401, in __get_result
(ServeController pid=874997)     raise self._exception
(ServeController pid=874997)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/venv/lib64/python3.11/site-packages/ray/serve/_private/replica.py", line 631, in initialize_and_get_metadata
(ServeController pid=874997)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=874997) RuntimeError: Traceback (most recent call last):
(ServeController pid=874997)   File "/venv/lib64/python3.11/site-packages/ray/serve/_private/replica.py", line 609, in initialize_and_get_metadata
(ServeController pid=874997)     await self._user_callable_wrapper.initialize_callable()
(ServeController pid=874997)   File "/venv/lib64/python3.11/site-packages/ray/serve/_private/replica.py", line 901, in initialize_callable
(ServeController pid=874997)     await self._call_func_or_gen(
(ServeController pid=874997)     result = callable(*args, **kwargs)
(ServeController pid=874997)   File "/venv/lib64/python3.11/site-packages/ray/serve/api.py", line 219, in __init__
(ServeController pid=874997)     cls.__init__(self, *args, **kwargs)
(ServeController pid=874997)   File "/projects/llm-apis/llm.py", line 50, in __init__
(ServeController pid=874997)     self.engine = AsyncLLMEngine.from_engine_args(engine_args)
(ServeController pid=874997)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/engine/async_llm_engine.py", line 661, in from_engine_args
(ServeController pid=874997)     engine_config = engine_args.create_engine_config()
(ServeController pid=874997)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/engine/arg_utils.py", line 771, in create_engine_config
(ServeController pid=874997)     model_config = ModelConfig(
(ServeController pid=874997)   File "/paolovic/vllm/vllm/config.py", line 227, in __init__
(ServeController pid=874997)     self._verify_quantization()
(ServeController pid=874997)   File "/paolovic/vllm/vllm/config.py", line 285, in _verify_quantization
(ServeController pid=874997)     quantization_override = method.override_quantization_method(
(ServeController pid=874997)                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/model_executor/layers/quantization/gptq_marlin.py", line 94, in override_quantization_method
(ServeController pid=874997)     can_convert = cls.is_gptq_marlin_compatible(hf_quant_cfg)
(ServeController pid=874997)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/model_executor/layers/quantization/gptq_marlin.py", line 142, in is_gptq_marlin_compatible
(ServeController pid=874997)     return check_marlin_supported(quant_type=cls.TYPE_MAP[(num_bits, sym)],
(ServeController pid=874997)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 78, in check_marlin_supported
(ServeController pid=874997)     cond, _ = _check_marlin_supported(quant_type, group_size, has_zp,
(ServeController pid=874997)               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 55, in _check_marlin_supported
(ServeController pid=874997)     major, minor = current_platform.get_device_capability()
(ServeController pid=874997)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/platforms/cuda.py", line 96, in get_device_capability
(ServeController pid=874997)     physical_device_id = device_id_to_physical_device_id(device_id)
(ServeController pid=874997)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/platforms/cuda.py", line 86, in device_id_to_physical_device_id
(ServeController pid=874997)     return int(physical_device_id)
(ServeController pid=874997)            ^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997) ValueError: invalid literal for int() with base 10: ''

Although I have set export CUDA_VISIBLE_DEVICES=0,1,2

(venv) [paolovic@abcd project]$ echo $CUDA_VISIBLE_DEVICES
0,1,2

Thank you very much for any help!

Before submitting a new issue...

paolovic commented 2 months ago

how can I report / ban @jia6214876 ?

danielhanchen commented 2 months ago

Definitely malware - it's also happening on other repos like Unsloth: https://github.com/unslothai/unsloth/issues/960

This seems to be spreading a bit. I'm guessing the vLLM maintainers are working overtime to report and block them.

youkaichao commented 2 months ago

first of all, notice that:

WARNING 08-27 02:59:41 cuda.py:22] You are using a deprecated pynvml package. Please install nvidia-ml-py instead. See https://pypi.org/project/pynvml for more information.

please uninstall pynvml .

second, you can try to place some debug print statements in:

(ServeController pid=874997) File "/paolovic/vllm/vllm/platforms/cuda.py", line 86, in device_id_to_physical_device_id

it looks very strange why you get this error.

paolovic commented 2 months ago

Hi @youkaichao , thank you very much for your support, again! FYI: I have built vllm from source

I removed pynvml but it didn't help.

So, I am logging out logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}") in /paolovic/vllm/vllm/platforms/cuda.py and it returns an empty environment variable (ServeReplica:default:VLLMDeployment pid=1016724) INFO 08-27 13:30:03 cuda.py:83] CUDA_VISIBLE_DEVICES:

In fact, I adapted the code in cuda.py like so

def device_id_to_physical_device_id(device_id: int) -> int:
    logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
    import ipdb; ipdb.set_trace()
    os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2"
    if "CUDA_VISIBLE_DEVICES" in os.environ:
        device_ids = os.environ["CUDA_VISIBLE_DEVICES"].split(",")
        physical_device_id = device_ids[device_id]
        return int(physical_device_id)
    else:
        return device_id

To be able to continue for now, I hardcoded os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2" Hardcoding helps to circumvent for this development phase, but can clearly not be used in production. So I have to find the root cause.

Furthermore, I run out of CUDA memory with the current hardcoded CUDA_VISIBLE_DEVICES setup. Accordingly, I wanted to enforce_eager to reduce the memory consumption serve run llm:build_app model="/models/llama-3-70b-instruct-awq-main-4bit/" tensor-parallel-size=2 quantization=awq enforce_eager=True

FYI, I had to remove the "True" value from the arg_strings = [x for x in arg_strings if x != "True"] in the def parse_vllm_args(cli_args: Dict[str, str]): of llm.py

But without the hardcoded os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2" in the cuda.py it fails -.- I have to fix this, hardcoding is not an option

youkaichao commented 2 months ago

for enforce_eager, it is a flag. please use --enforce_eager .

So, I am logging out logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}") in /paolovic/vllm/vllm/platforms/cuda.py and it returns an empty environment variable (ServeReplica:default:VLLMDeployment pid=1016724) INFO 08-27 13:30:03 cuda.py:83] CUDA_VISIBLE_DEVICES:

your problem arises from the fact that CUDA_VISIBLE_DEVICES is set to an empty string, which means cuda devices are disabled

you need to check which part of code changes this.

youkaichao commented 2 months ago

https://github.com/vllm-project/vllm/pull/7924 will give you clear error message.

but still, the problem does not come from vllm side I think. you can keep investigating, which part of your code leads to this problem.

paolovic commented 2 months ago

for enforce_eager, it is a flag. please use --enforce_eager .

Using a flag leads to an error, therefore my workaround

thank you very much @youkaichao

mcd01 commented 5 days ago

Another workaround to the problem is describe here, might be of relevance to others