Open Adevils opened 3 months ago
can you try to start an api server https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html and query it with openai client? manually use async llm engine can be error prone.
MIG 3g.40gb Device 0: MIG 3g.40gb Device 1:
i did mig into 2 gpu, and want to run 2 different model in 2 gpus , how can I mention what device id to use, when using the docker.??
Same Error
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.29.6 Libc version: glibc-2.31
Python version: 3.10.14 (main, Apr 6 2024, 18:45:05) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.15.0-1068-azure-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 10.1.243 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe MIG 3g.40gb Device 0: MIG 3g.40gb Device 1:
Nvidia driver version: 535.183.01 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 1 Core(s) per socket: 24 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7V13 64-Core Processor Stepping: 1 CPU MHz: 2445.437 BogoMIPS: 4890.87 Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 768 KiB L1i cache: 768 KiB L2 cache: 12 MiB L3 cache: 96 MiB NUMA node0 CPU(s): 0-23 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm
Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.0 [pip3] torchvision==0.18.1 [pip3] transformers==4.42.2 [pip3] triton==2.3.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.0.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-23 0 N/A
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
### 🐛 Describe the bug
import asyncio from uuid import uuid4 import os from vllm import AsyncEngineArgs, AsyncLLMEngine, SamplingParams
device_id = "MIG-0f67f02c-98bf-5250-9b0d-530252d6817f"
os.environ["CUDA_VISIBLE_DEVICES"] = device_id
def main(): engine = AsyncLLMEngine.from_engine_args( AsyncEngineArgs(model="google/gemma-2b", tensor_parallel_size=1, gpu_memory_utilization=0.2, max_model_len=1024, dtype="bfloat16") )
if name == "main": main()
Below is the Terminal Output when the code is run:
INFO 07-19 10:36:52 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='google/gemma-2b', speculative_config=None, tokenizer='google/gemma-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=google/gemma-2b, use_v2_block_manager=False, enable_prefix_caching=False) WARNING 07-19 10:36:53 gemma.py:56] Gemma's activation function was incorrectly set to exact GeLU in the config JSON file when it was initially released. Changing the activation function to approximate GeLU (
gelu_pytorch_tanh
). If you want to use the legacygelu
, edit the config JSON to sethidden_activation=gelu
instead ofhidden_act
. See https://github.com/huggingface/transformers/pull/29402 for more details. INFO 07-19 10:36:53 weight_utils.py:218] Using model weights format ['*.safetensors'] INFO 07-19 10:36:54 model_runner.py:255] Loading model weights took 4.7384 GB INFO 07-19 10:36:55 gpu_executor.py:84] # GPU blocks: 3834, # CPU blocks: 14563 INFO 07-19 10:36:57 model_runner.py:924] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 07-19 10:36:57 model_runner.py:928] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasinggpu_memory_utilization
or enforcing eager mode. You can also reduce themax_num_seqs
as needed to decrease memory usage. INFO 07-19 10:37:02 model_runner.py:1117] Graph capturing finished in 5 secs. INFO 07-19 10:37:02 async_llm_engine.py:646] Received request a0f2ca29-5058-4ad0-9a17-59ed431ab860: prompt: 'What is 5+3?', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.1, temperature=0.01, top_p=0.9, top_k=10, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=150, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: None, lora_request: None. INFO 07-19 10:37:02 async_llm_engine.py:168] Aborted request a0f2ca29-5058-4ad0-9a17-59ed431ab860. INFO 07-19 10:37:02 async_llm_engine.py:50] Engine is gracefully shutting down. [RequestOutput(request_id=a0f2ca29-5058-4ad0-9a17-59ed431ab860, prompt='What is 5+3?', prompt_token_ids=[2, 1841, 603, 235248, 235308, 235340, 235304, 235336], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='\n\n', token_ids=(109,), cumulative_logprob=0.0, logprobs=None, finish_reason=None, stop_reason=None)], finished=False, metrics=RequestMetrics(arrival_time=1721385422.8926418, last_token_time=1721385422.9454794, first_scheduled_time=1721385422.8951688, first_token_time=1721385422.9452248, time_in_queue=0.002526998519897461, finished_time=None), lora_request=None)]Why i am facing this error?? I wanted to handle concurrent requests using vllm and fast api, Is there any basic source code available ??