[Bug]: Mi300x x8 unable to lauch openai/api_server.py on rocm vllm branch.

ferrybaltimore commented 2 months ago

Your current environment

The output of `python collect_env.py`

Collecting environment information... Failed to import from vllm._C with No module named 'vllm._C' PyTorch version: 2.3.0a0+git2e4abc8 Is debug build: False CUDA used to build PyTorch: N/A ROCM used to build PyTorch: 6.1.40093-bd86f1708 OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.1.2 24193 669db884972e769450470020c06a6f132a8a065b) CMake version: version 3.30.2 Libc version: glibc-2.31 Python version: 3.9.19 (main, May 6 2024, 19:43:03) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-119-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: AMD Instinct MI300X (gfx942:sramecc+:xnack-) Nvidia driver version: Could not collect cuDNN version: Could not collect HIP runtime version: 6.1.40093 MIOpen runtime version: 3.1.0 Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 52 bits physical, 57 bits virtual CPU(s): 192 On-line CPU(s) list: 0-191 Thread(s) per core: 1 Core(s) per socket: 96 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 25 Model: 17 Model name: AMD EPYC 9654 96-Core Processor Stepping: 1 Frequency boost: enabled CPU MHz: 2400.000 CPU max MHz: 3707.8120 CPU min MHz: 1500.0000 BogoMIPS: 4793.00 Virtualization: AMD-V L1d cache: 6 MiB L1i cache: 6 MiB L2 cache: 192 MiB L3 cache: 768 MiB NUMA node0 CPU(s): 0-95 NUMA node1 CPU(s): 96-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d Versions of relevant libraries: [pip3] mypy==1.7.0 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] optree==0.9.1 [pip3] torch==2.3.0a0+git2e4abc8 [pip3] torchvision==0.18.0a0+6f0deb9 [pip3] triton==3.0.0 [conda] No relevant packagesROCM Version: 6.1.40093-bd86f1708 Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: Could not collect

🐛 Describe the bug

When I run it with the last version of the vllm rocm ( https://github.com/ROCm/vllm )

torchrun --standalone --nnodes=1 --nproc_per_node=8 ./vllm/vllm/entrypoints/openai/api_server.py --model Meta-Llama-3.1-405B-Instruct --port 8010

I got:

File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_device.py", line 77, in __torch_function__ return func(*args, **kwargs) torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 3.25 GiB. GPU 0 has a total capacity of 191.98 GiB of which 1.32 GiB is free. Of the allocated memory 5.01 GiB is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) INFO 08-28 11:24:09 selector.py:56] Using ROCmFlashAttention backend.

I used the docker from yesterday.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

youkaichao commented 2 months ago

why do you use torchrun here?

for tensor parallel inference, did you check https://docs.vllm.ai/en/stable/serving/distributed_serving.html ?

ferrybaltimore commented 2 months ago

Hi @youkaichao, I used torchrun because seems is the default GPU executor on teh Rocm brach. But also if I try to force ray with --worker-use-ray it fails too. Another error ...

https://github.com/ROCm/vllm/blob/main/ROCm_performance.md

youkaichao commented 2 months ago

we don't use torchrun . sorry I don't know about the status of the rocm fork.

cc @hongxiayang if you can help.

gshtras commented 2 months ago

The command line for running server on ROCm/vllm should be python3 -m vllm.entrypoints.openai.api_server --distributed-executor-backend mp <<your other parameters such as --model -tp etc>>

ferrybaltimore commented 2 months ago

Hi @gshtras , it works perfect! thanks!

vllm-project / vllm