Closed ferrybaltimore closed 2 months ago
why do you use torchrun
here?
for tensor parallel inference, did you check https://docs.vllm.ai/en/stable/serving/distributed_serving.html ?
Hi @youkaichao, I used torchrun because seems is the default GPU executor on teh Rocm brach. But also if I try to force ray with --worker-use-ray it fails too. Another error ...
we don't use torchrun . sorry I don't know about the status of the rocm fork.
cc @hongxiayang if you can help.
The command line for running server on ROCm/vllm should be
python3 -m vllm.entrypoints.openai.api_server --distributed-executor-backend mp <<your other parameters such as --model -tp etc>>
Hi @gshtras , it works perfect! thanks!
Your current environment
The output of `python collect_env.py`
Collecting environment information... Failed to import from vllm._C with No module named 'vllm._C' PyTorch version: 2.3.0a0+git2e4abc8 Is debug build: False CUDA used to build PyTorch: N/A ROCM used to build PyTorch: 6.1.40093-bd86f1708 OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b) CMake version: version 3.30.2 Libc version: glibc-2.31 Python version: 3.9.19 (main, May 6 2024, 19:43:03) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-119-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: AMD Instinct MI300X (gfx942:sramecc+:xnack-) Nvidia driver version: Could not collect cuDNN version: Could not collect HIP runtime version: 6.1.40093 MIOpen runtime version: 3.1.0 Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 52 bits physical, 57 bits virtual CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 1 Core(s) per socket: 96 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 25 Model: 17 Model name: AMD EPYC 9654 96-Core Processor Stepping: 1 Frequency boost: enabled CPU MHz: 2400.000 CPU max MHz: 3707.8120 CPU min MHz: 1500.0000 BogoMIPS: 4793.00 Virtualization: AMD-V L1d cache: 6 MiB L1i cache: 6 MiB L2 cache: 192 MiB L3 cache: 768 MiB NUMA node0 CPU(s): 0-95 NUMA node1 CPU(s): 96-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d Versions of relevant libraries: [pip3] mypy==1.7.0 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] optree==0.9.1 [pip3] torch==2.3.0a0+git2e4abc8 [pip3] torchvision==0.18.0a0+6f0deb9 [pip3] triton==3.0.0 [conda] No relevant packagesROCM Version: 6.1.40093-bd86f1708 Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: Could not collect🐛 Describe the bug
When I run it with the last version of the vllm rocm ( https://github.com/ROCm/vllm )
torchrun --standalone --nnodes=1 --nproc_per_node=8 ./vllm/vllm/entrypoints/openai/api_server.py --model Meta-Llama-3.1-405B-Instruct --port 8010
I got:
File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_device.py", line 77, in __torch_function__ return func(*args, **kwargs) torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 3.25 GiB. GPU 0 has a total capacity of 191.98 GiB of which 1.32 GiB is free. Of the allocated memory 5.01 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) INFO 08-28 11:24:09 selector.py:56] Using ROCmFlashAttention backend.
I used the docker from yesterday.
Before submitting a new issue...