[Bug]: CUDA_VISIBLE_DEVICES not detected

paolovic commented 3 months ago

Your current environment

The output of `python collect_env.py`

```text Collecting environment information... WARNING 08-27 02:59:41 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead. See https://pypi.org/project/pynvml for more information. PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Red Hat Enterprise Linux release 8.10 (Ootpa) (x86_64) GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22) Clang version: Could not collect CMake version: version 3.29.0 Libc version: glibc-2.28 Python version: 3.11.9 (main, Jun 19 2024, 10:02:06) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)] (64-bit runtime) Python platform: Linux-4.18.0-553.8.1.el8_10.x86_64-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA L40S-48C GPU 1: NVIDIA L40S-48C GPU 2: NVIDIA L40S-48C Nvidia driver version: 535.129.03 cuDNN version: Probably one of the following: /usr/lib64/libcudnn.so.9.3.0 /usr/lib64/libcudnn_adv.so.9.3.0 /usr/lib64/libcudnn_cnn.so.9.3.0 /usr/lib64/libcudnn_engines_precompiled.so.9.3.0 /usr/lib64/libcudnn_engines_runtime_compiled.so.9.3.0 /usr/lib64/libcudnn_graph.so.9.3.0 /usr/lib64/libcudnn_heuristic.so.9.3.0 /usr/lib64/libcudnn_ops.so.9.3.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 1 Core(s) per socket: 12 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 143 Model name: Intel(R) Xeon(R) Platinum 8462Y+ Stepping: 8 CPU MHz: 2799.999 BogoMIPS: 5599.99 Hypervisor vendor: VMware Virtualization type: full L1d cache: 48K L1i cache: 32K L2 cache: 2048K L3 cache: 61440K NUMA node0 CPU(s): 0-11 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear flush_l1d arch_capabilities Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.555.43 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.4.127 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pynvml==11.5.0 [pip3] pyzmq==26.2.0 [pip3] sentence-transformers==2.5.1 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [pip3] vllm_nccl_cu12==2.18.1.0.4.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.5@e397b92f84b7771cfd04b8fbb87894e9ec95f873 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PIX PIX 0-11 0 N/A GPU1 PIX X PIX 0-11 0 N/A GPU2 PIX PIX X 0-11 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

Hi,

I am trying to execute the following llm.py from https://docs.ray.io/en/latest/serve/tutorials/vllm-example.html

from typing import Dict, Optional, List
import logging

from fastapi import FastAPI
from starlette.requests import Request
from starlette.responses import StreamingResponse, JSONResponse

from ray import serve

from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.entrypoints.openai.cli_args import make_arg_parser
from vllm.entrypoints.openai.protocol import (
    ChatCompletionRequest,
    ChatCompletionResponse,
    ErrorResponse,
)
from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
from vllm.entrypoints.openai.serving_engine import LoRAModulePath
from vllm.utils import FlexibleArgumentParser

logger = logging.getLogger("ray.serve")

app = FastAPI()

@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 10,
        "target_ongoing_requests": 5,
    },
    max_ongoing_requests=10,
)
@serve.ingress(app)
class VLLMDeployment:
    def __init__(
        self,
        engine_args: AsyncEngineArgs,
        response_role: str,
        lora_modules: Optional[List[LoRAModulePath]] = None,
        chat_template: Optional[str] = None,
    ):
        logger.info(f"Starting with engine args: {engine_args}")
        self.openai_serving_chat = None
        self.engine_args = engine_args
        self.response_role = response_role
        self.lora_modules = lora_modules
        self.chat_template = chat_template
        self.engine = AsyncLLMEngine.from_engine_args(engine_args)

    @app.post("/v1/chat/completions")
    async def create_chat_completion(
        self, request: ChatCompletionRequest, raw_request: Request
    ):
        """OpenAI-compatible HTTP endpoint.

        API reference:
            - https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
        """
        if not self.openai_serving_chat:
            model_config = await self.engine.get_model_config()
            # Determine the name of the served model for the OpenAI client.
            if self.engine_args.served_model_name is not None:
                served_model_names = self.engine_args.served_model_name
            else:
                served_model_names = [self.engine_args.model]
            self.openai_serving_chat = OpenAIServingChat(
                self.engine,
                model_config,
                served_model_names,
                self.response_role,
                self.lora_modules,
                self.chat_template,
            )
        logger.info(f"Request: {request}")
        generator = await self.openai_serving_chat.create_chat_completion(
            request, raw_request
        )
        if isinstance(generator, ErrorResponse):
            return JSONResponse(
                content=generator.model_dump(), status_code=generator.code
            )
        if request.stream:
            return StreamingResponse(content=generator, media_type="text/event-stream")
        else:
            assert isinstance(generator, ChatCompletionResponse)
            return JSONResponse(content=generator.model_dump())

def parse_vllm_args(cli_args: Dict[str, str]):
    """Parses vLLM args based on CLI inputs.

    Currently uses argparse because vLLM doesn't expose Python models for all of the
    config options we want to support.
    """
    arg_parser = FlexibleArgumentParser(
        description="vLLM OpenAI-Compatible RESTful API server."
    )

    parser = make_arg_parser(arg_parser)
    arg_strings = []
    for key, value in cli_args.items():
        arg_strings.extend([f"--{key}", str(value)])
    logger.info(arg_strings)
    parsed_args = parser.parse_args(args=arg_strings)
    return parsed_args

def build_app(cli_args: Dict[str, str]) -> serve.Application:
    """Builds the Serve app based on CLI arguments.

    See https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#command-line-arguments-for-the-server
    for the complete set of arguments.

    Supported engine arguments: https://docs.vllm.ai/en/latest/models/engine_args.html.
    """  # noqa: E501
    parsed_args = parse_vllm_args(cli_args)
    engine_args = AsyncEngineArgs.from_cli_args(parsed_args)
    engine_args.worker_use_ray = True

    tp = engine_args.tensor_parallel_size
    logger.info(f"Tensor parallelism = {tp}")
    pg_resources = []
    pg_resources.append({"CPU": 1})  # for the deployment replica
    for i in range(tp):
        pg_resources.append({"CPU": 1, "GPU": 1})  # for the vLLM actors

    # We use the "STRICT_PACK" strategy below to ensure all vLLM actors are placed on
    # the same Ray node.
    return VLLMDeployment.options(
        placement_group_bundles=pg_resources, placement_group_strategy="STRICT_PACK"
    ).bind(
        engine_args,
        parsed_args.response_role,
        parsed_args.lora_modules,
        parsed_args.chat_template,
    )

I execute it like the following: serve run llm:build_app model="/u01/data/analytics/models/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/" tensor-parallel-size=2

Unfortunaley, it fails when trying to detect the CUDA_VISIBLE_DEVICES

(ServeReplica:default:VLLMDeployment pid=875095) INFO 2024-08-27 02:44:24,616 default_VLLMDeployment nxudbh1m llm.py:44 - Starting with engine args: AsyncEngineArgs(model='/models/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/', served_model_name=None, tokenizer='/models/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4/', skip_tokenizer_init=False, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, seed=0, max_model_len=None, worker_use_ray=True, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, quantization=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, enable_lora=False, max_loras=1, max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256, long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None, device='auto', num_scheduler_steps=1, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, guided_decoding_backend='outlines', speculative_model=None, speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None, num_speculative_tokens=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, qlora_adapter_name_or_path=None, disable_logprobs_during_spec_decoding=None, otlp_traces_endpoint=None, collect_detailed_traces=None, engine_use_ray=False, disable_log_requests=False)
(ServeReplica:default:VLLMDeployment pid=875095) ERROR 2024-08-27 02:44:24,692 default_VLLMDeployment nxudbh1m replica.py:1199 - Exception during graceful shutdown of replica: 'VLLMDeployment' object has no attribute '_serve_asgi_lifespan'
(ServeReplica:default:VLLMDeployment pid=875095)   File "/venv/lib64/python3.11/site-packages/ray/serve/_private/replica.py", line 1193, in call_destructor
(ServeReplica:default:VLLMDeployment pid=875095)     await self._call_func_or_gen(self._callable.__del__)
(ServeReplica:default:VLLMDeployment pid=875095)     result = await result
(ServeReplica:default:VLLMDeployment pid=875095)   File "/venv/lib64/python3.11/site-packages/ray/serve/api.py", line 225, in __del__
(ServeReplica:default:VLLMDeployment pid=875095)     await ASGIAppReplicaWrapper.__del__(self)
(ServeReplica:default:VLLMDeployment pid=875095)   File "/venv/lib64/python3.11/site-packages/ray/serve/_private/http_util.py", line 472, in __del__
(ServeReplica:default:VLLMDeployment pid=875095)     with LoggingContext(self._serve_asgi_lifespan.logger, level=logging.WARNING):
(ServeReplica:default:VLLMDeployment pid=875095) AttributeError: 'VLLMDeployment' object has no attribute '_serve_asgi_lifespan'
(ServeController pid=874997) INFO 2024-08-27 02:44:24,798 controller 874997 deployment_state.py:2182 - Replica(id='nxudbh1m', deployment='VLLMDeployment', app='default') is stopped.
^C2024-08-27 02:44:28,414       INFO scripts.py:585 -- Got KeyboardInterrupt, shutting down...
(ServeController pid=874997) INFO 2024-08-27 02:44:28,463 controller 874997 deployment_state.py:1860 - Removing 1 replica from Deployment(name='VLLMDeployment', app='default').
(ServeController pid=874997) INFO 2024-08-27 02:44:28,568 controller 874997 deployment_state.py:2182 - Replica(id='258tpy6w', deployment='VLLMDeployment', app='default') is stopped.
(ServeController pid=874997) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::ServeReplica:default:VLLMDeployment.initialize_and_get_metadata() (pid=875205, ip=159.103.253.239, actor_id=ef425b96a2e5be70fd0d8d8001000000, repr=<ray.serve._private.replica.ServeReplica:default:VLLMDeployment object at 0x7f3fd259e550>)
(ServeController pid=874997)   File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 449, in result
(ServeController pid=874997)     return self.__get_result()
(ServeController pid=874997)            ^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 401, in __get_result
(ServeController pid=874997)     raise self._exception
(ServeController pid=874997)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/venv/lib64/python3.11/site-packages/ray/serve/_private/replica.py", line 631, in initialize_and_get_metadata
(ServeController pid=874997)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=874997) RuntimeError: Traceback (most recent call last):
(ServeController pid=874997)   File "/venv/lib64/python3.11/site-packages/ray/serve/_private/replica.py", line 609, in initialize_and_get_metadata
(ServeController pid=874997)     await self._user_callable_wrapper.initialize_callable()
(ServeController pid=874997)   File "/venv/lib64/python3.11/site-packages/ray/serve/_private/replica.py", line 901, in initialize_callable
(ServeController pid=874997)     await self._call_func_or_gen(
(ServeController pid=874997)     result = callable(*args, **kwargs)
(ServeController pid=874997)   File "/venv/lib64/python3.11/site-packages/ray/serve/api.py", line 219, in __init__
(ServeController pid=874997)     cls.__init__(self, *args, **kwargs)
(ServeController pid=874997)   File "/projects/llm-apis/llm.py", line 50, in __init__
(ServeController pid=874997)     self.engine = AsyncLLMEngine.from_engine_args(engine_args)
(ServeController pid=874997)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/engine/async_llm_engine.py", line 661, in from_engine_args
(ServeController pid=874997)     engine_config = engine_args.create_engine_config()
(ServeController pid=874997)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/engine/arg_utils.py", line 771, in create_engine_config
(ServeController pid=874997)     model_config = ModelConfig(
(ServeController pid=874997)   File "/paolovic/vllm/vllm/config.py", line 227, in __init__
(ServeController pid=874997)     self._verify_quantization()
(ServeController pid=874997)   File "/paolovic/vllm/vllm/config.py", line 285, in _verify_quantization
(ServeController pid=874997)     quantization_override = method.override_quantization_method(
(ServeController pid=874997)                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/model_executor/layers/quantization/gptq_marlin.py", line 94, in override_quantization_method
(ServeController pid=874997)     can_convert = cls.is_gptq_marlin_compatible(hf_quant_cfg)
(ServeController pid=874997)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/model_executor/layers/quantization/gptq_marlin.py", line 142, in is_gptq_marlin_compatible
(ServeController pid=874997)     return check_marlin_supported(quant_type=cls.TYPE_MAP[(num_bits, sym)],
(ServeController pid=874997)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 78, in check_marlin_supported
(ServeController pid=874997)     cond, _ = _check_marlin_supported(quant_type, group_size, has_zp,
(ServeController pid=874997)               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 55, in _check_marlin_supported
(ServeController pid=874997)     major, minor = current_platform.get_device_capability()
(ServeController pid=874997)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/platforms/cuda.py", line 96, in get_device_capability
(ServeController pid=874997)     physical_device_id = device_id_to_physical_device_id(device_id)
(ServeController pid=874997)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997)   File "/paolovic/vllm/vllm/platforms/cuda.py", line 86, in device_id_to_physical_device_id
(ServeController pid=874997)     return int(physical_device_id)
(ServeController pid=874997)            ^^^^^^^^^^^^^^^^^^^^^^^
(ServeController pid=874997) ValueError: invalid literal for int() with base 10: ''

Although I have set export CUDA_VISIBLE_DEVICES=0,1,2

(venv) [paolovic@abcd project]$ echo $CUDA_VISIBLE_DEVICES
0,1,2

Thank you very much for any help!

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

paolovic commented 3 months ago

how can I report / ban @jia6214876 ?

danielhanchen commented 3 months ago

Definitely malware - it's also happening on other repos like Unsloth: https://github.com/unslothai/unsloth/issues/960

This seems to be spreading a bit. I'm guessing the vLLM maintainers are working overtime to report and block them.

youkaichao commented 3 months ago

first of all, notice that:

WARNING 08-27 02:59:41 cuda.py:22] You are using a deprecated pynvml package. Please install nvidia-ml-py instead. See https://pypi.org/project/pynvml for more information.

please uninstall pynvml .

second, you can try to place some debug print statements in:

(ServeController pid=874997) File "/paolovic/vllm/vllm/platforms/cuda.py", line 86, in device_id_to_physical_device_id

it looks very strange why you get this error.

paolovic commented 3 months ago

Hi @youkaichao , thank you very much for your support, again! FYI: I have built vllm from source

I removed pynvml but it didn't help.

So, I am logging out logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}") in /paolovic/vllm/vllm/platforms/cuda.py and it returns an empty environment variable (ServeReplica:default:VLLMDeployment pid=1016724) INFO 08-27 13:30:03 cuda.py:83] CUDA_VISIBLE_DEVICES:

In fact, I adapted the code in cuda.py like so

def device_id_to_physical_device_id(device_id: int) -> int:
    logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}")
    import ipdb; ipdb.set_trace()
    os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2"
    if "CUDA_VISIBLE_DEVICES" in os.environ:
        device_ids = os.environ["CUDA_VISIBLE_DEVICES"].split(",")
        physical_device_id = device_ids[device_id]
        return int(physical_device_id)
    else:
        return device_id

To be able to continue for now, I hardcoded os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2" Hardcoding helps to circumvent for this development phase, but can clearly not be used in production. So I have to find the root cause.

Furthermore, I run out of CUDA memory with the current hardcoded CUDA_VISIBLE_DEVICES setup. Accordingly, I wanted to enforce_eager to reduce the memory consumption serve run llm:build_app model="/models/llama-3-70b-instruct-awq-main-4bit/" tensor-parallel-size=2 quantization=awq enforce_eager=True

FYI, I had to remove the "True" value from the arg_strings = [x for x in arg_strings if x != "True"] in the def parse_vllm_args(cli_args: Dict[str, str]): of llm.py

But without the hardcoded os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2" in the cuda.py it fails -.- I have to fix this, hardcoding is not an option

youkaichao commented 3 months ago

for enforce_eager, it is a flag. please use --enforce_eager .

So, I am logging out logger.info(f"CUDA_VISIBLE_DEVICES: {os.environ['CUDA_VISIBLE_DEVICES']}") in /paolovic/vllm/vllm/platforms/cuda.py and it returns an empty environment variable (ServeReplica:default:VLLMDeployment pid=1016724) INFO 08-27 13:30:03 cuda.py:83] CUDA_VISIBLE_DEVICES:

your problem arises from the fact that CUDA_VISIBLE_DEVICES is set to an empty string, which means cuda devices are disabled

you need to check which part of code changes this.

youkaichao commented 3 months ago

https://github.com/vllm-project/vllm/pull/7924 will give you clear error message.

but still, the problem does not come from vllm side I think. you can keep investigating, which part of your code leads to this problem.

paolovic commented 3 months ago

for enforce_eager, it is a flag. please use --enforce_eager .

Using a flag leads to an error, therefore my workaround

thank you very much @youkaichao

mcd01 commented 4 weeks ago

Another workaround to the problem is describe here, might be of relevance to others

vllm-project / vllm