Open AjayKadoula opened 4 months ago
same problem on same gpu... any progress?
same issue face in ubuntu also AMD_LOG_LEVEL=3
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 252kB/s]
INFO 04-19 04:44:48 llm_engine.py:79] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 287kB/s]
vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.19MB/s]
merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 20.5MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 646kB/s]
:3:rocdevice.cpp :445 : 3852326123 us: [pid:9 tid:0x7fcd5fa0c4c0] Initializing HSA stack.
:3:comgrctx.cpp :33 : 3852378915 us: [pid:9 tid:0x7fcd5fa0c4c0] Loading COMGR library.
:3:rocdevice.cpp :211 : 3852378983 us: [pid:9 tid:0x7fcd5fa0c4c0] Numa selects cpu agent[0]=0x859e1f0(fine=0x7c1f0a0,coarse=0x96cc5f0) for gpu agent=0x96cb260 CPU<->GPU XGMI=0
:3:rocdevice.cpp :1715: 3852379594 us: [pid:9 tid:0x7fcd5fa0c4c0] Gfx Major/Minor/Stepping: 9/0/10
:3:rocdevice.cpp :1717: 3852379601 us: [pid:9 tid:0x7fcd5fa0c4c0] HMM support: 1, XNACK: 0, Direct host access: 0
:3:rocdevice.cpp :1719: 3852379605 us: [pid:9 tid:0x7fcd5fa0c4c0] Max SDMA Read Mask: 0x1e, Max SDMA Write Mask: 0x1f
:3:hip_context.cpp :48 : 3852380443 us: [pid:9 tid:0x7fcd5fa0c4c0] Direct Dispatch: 1
:3:hip_device_runtime.cpp :637 : 3852919412 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7ffc2e1c6160 )
:3:hip_device_runtime.cpp :639 : 3852919436 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess :
:3:hip_device_runtime.cpp :637 : 3852919489 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7fccaafe1f14 )
:3:hip_device_runtime.cpp :639 : 3852919494 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess :
:3:hip_device.cpp :463 : 3852919500 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevicePropertiesR0600 ( 0x7ffc2e1c5bd8, 0 )
:3:hip_device.cpp :465 : 3852919507 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevicePropertiesR0600: Returned hipSuccess :
:3:hip_device_runtime.cpp :637 : 3852919622 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7ffc2e1c6198 )
:3:hip_device_runtime.cpp :639 : 3852919626 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess :
:3:hip_device_runtime.cpp :622 : 3852919647 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c5f04 )
:3:hip_device_runtime.cpp :630 : 3852919652 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess :
:3:hip_device_runtime.cpp :637 : 3852919658 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7ffc2e1c5c80 )
:3:hip_device_runtime.cpp :639 : 3852919662 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess :
:3:hip_context.cpp :344 : 3852920392 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState ( 0, 0x7ffc2e1c5d18, 0x7ffc2e1c5d1c )
:3:hip_context.cpp :358 : 3852920400 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState: Returned hipSuccess :
:3:hip_device_runtime.cpp :622 : 3852920405 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c5f64 )
:3:hip_device_runtime.cpp :630 : 3852920409 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess :
:3:hip_context.cpp :344 : 3852920414 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState ( 0, 0x7ffc2e1c5d78, 0x7ffc2e1c5d7c )
:3:hip_context.cpp :358 : 3852920418 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState: Returned hipSuccess :
:3:hip_device_runtime.cpp :622 : 3852920425 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c5ef4 )
:3:hip_device_runtime.cpp :630 : 3852920429 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess :
:3:hip_context.cpp :344 : 3852920432 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState ( 0, 0x7ffc2e1c5d08, 0x7ffc2e1c5d0c )
:3:hip_context.cpp :358 : 3852920436 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState: Returned hipSuccess :
:3:hip_device_runtime.cpp :622 : 3852921568 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c6644 )
:3:hip_device_runtime.cpp :630 : 3852921575 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess :
:3:hip_device_runtime.cpp :622 : 3852921698 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c64b4 )
:3:hip_device_runtime.cpp :630 : 3852921701 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess :
:3:hip_device_runtime.cpp :622 : 3852921726 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c62c0 )
:3:hip_device_runtime.cpp :630 : 3852921730 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess :
:3:hip_memory.cpp :764 : 3852921741 us: [pid:9 tid:0x7fcd5fa0c4c0] hipMemGetInfo ( 0x7ffc2e1c6298, 0x7ffc2e1c62a0 )
:1:rocdevice.cpp :1824: 3852921762 us: [pid:9 tid:0x7fcd5fa0c4c0] HSA_AMD_AGENT_INFO_MEMORY_AVAIL query failed.
:3:hip_memory.cpp :777 : 3852921767 us: [pid:9 tid:0x7fcd5fa0c4c0] hipMemGetInfo: Returned hipErrorInvalidValue :
:3:hip_error.cpp :35 : 3852921769 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetLastError ( )
:3:hip_device_runtime.cpp :652 : 3852922327 us: [pid:9 tid:0x7fcd5fa0c4c0] hipSetDevice ( 0 )
:3:hip_device_runtime.cpp :656 : 3852922332 us: [pid:9 tid:0x7fcd5fa0c4c0] hipSetDevice: Returned hipSuccess :
Traceback (most recent call last):
File "/app/model/vllm_example.py", line 11, in TORCH_USE_HIP_DSA
to enable device-side assertions.
:1:hip_fatbin.cpp :83 : 3853425875 us: [pid:9 tid:0x7fcd5fa0c4c0] All Unique FDs are closed
is it solved?
is it solved?
System config: hostnamectl Operating System: Red Hat Enterprise Linux 8.7 (Ootpa) Kernel: Linux 4.18.0-425.3.1.el8.x86_64 Architecture: x86-64
rocm driver 5.7.0 amd driver: 5.7.0 vllm container version: embeddedllminfo/vllm-rocm vllm-v0.2.4 RHEL8.7 GPU:MI210
Also same config with RHEL8.8, It is working, But with 8.7 it is not working.