Open AjayKadoula opened 8 months ago
same problem on same gpu... any progress?
same issue face in ubuntu also AMD_LOG_LEVEL=3
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 252kB/s]
INFO 04-19 04:44:48 llm_engine.py:79] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 287kB/s]
vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.19MB/s]
merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 20.5MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 646kB/s]
:3:rocdevice.cpp :445 : 3852326123 us: [pid:9 tid:0x7fcd5fa0c4c0] Initializing HSA stack.
:3:comgrctx.cpp :33 : 3852378915 us: [pid:9 tid:0x7fcd5fa0c4c0] Loading COMGR library.
:3:rocdevice.cpp :211 : 3852378983 us: [pid:9 tid:0x7fcd5fa0c4c0] Numa selects cpu agent[0]=0x859e1f0(fine=0x7c1f0a0,coarse=0x96cc5f0) for gpu agent=0x96cb260 CPU<->GPU XGMI=0
:3:rocdevice.cpp :1715: 3852379594 us: [pid:9 tid:0x7fcd5fa0c4c0] Gfx Major/Minor/Stepping: 9/0/10
:3:rocdevice.cpp :1717: 3852379601 us: [pid:9 tid:0x7fcd5fa0c4c0] HMM support: 1, XNACK: 0, Direct host access: 0
:3:rocdevice.cpp :1719: 3852379605 us: [pid:9 tid:0x7fcd5fa0c4c0] Max SDMA Read Mask: 0x1e, Max SDMA Write Mask: 0x1f
:3:hip_context.cpp :48 : 3852380443 us: [pid:9 tid:0x7fcd5fa0c4c0] Direct Dispatch: 1
:3:hip_device_runtime.cpp :637 : 3852919412 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7ffc2e1c6160 )
:3:hip_device_runtime.cpp :639 : 3852919436 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess :
:3:hip_device_runtime.cpp :637 : 3852919489 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7fccaafe1f14 )
:3:hip_device_runtime.cpp :639 : 3852919494 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess :
:3:hip_device.cpp :463 : 3852919500 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevicePropertiesR0600 ( 0x7ffc2e1c5bd8, 0 )
:3:hip_device.cpp :465 : 3852919507 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevicePropertiesR0600: Returned hipSuccess :
:3:hip_device_runtime.cpp :637 : 3852919622 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7ffc2e1c6198 )
:3:hip_device_runtime.cpp :639 : 3852919626 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess :
:3:hip_device_runtime.cpp :622 : 3852919647 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c5f04 )
:3:hip_device_runtime.cpp :630 : 3852919652 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess :
:3:hip_device_runtime.cpp :637 : 3852919658 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7ffc2e1c5c80 )
:3:hip_device_runtime.cpp :639 : 3852919662 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess :
:3:hip_context.cpp :344 : 3852920392 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState ( 0, 0x7ffc2e1c5d18, 0x7ffc2e1c5d1c )
:3:hip_context.cpp :358 : 3852920400 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState: Returned hipSuccess :
:3:hip_device_runtime.cpp :622 : 3852920405 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c5f64 )
:3:hip_device_runtime.cpp :630 : 3852920409 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess :
:3:hip_context.cpp :344 : 3852920414 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState ( 0, 0x7ffc2e1c5d78, 0x7ffc2e1c5d7c )
:3:hip_context.cpp :358 : 3852920418 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState: Returned hipSuccess :
:3:hip_device_runtime.cpp :622 : 3852920425 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c5ef4 )
:3:hip_device_runtime.cpp :630 : 3852920429 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess :
:3:hip_context.cpp :344 : 3852920432 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState ( 0, 0x7ffc2e1c5d08, 0x7ffc2e1c5d0c )
:3:hip_context.cpp :358 : 3852920436 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState: Returned hipSuccess :
:3:hip_device_runtime.cpp :622 : 3852921568 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c6644 )
:3:hip_device_runtime.cpp :630 : 3852921575 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess :
:3:hip_device_runtime.cpp :622 : 3852921698 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c64b4 )
:3:hip_device_runtime.cpp :630 : 3852921701 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess :
:3:hip_device_runtime.cpp :622 : 3852921726 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c62c0 )
:3:hip_device_runtime.cpp :630 : 3852921730 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess :
:3:hip_memory.cpp :764 : 3852921741 us: [pid:9 tid:0x7fcd5fa0c4c0] hipMemGetInfo ( 0x7ffc2e1c6298, 0x7ffc2e1c62a0 )
:1:rocdevice.cpp :1824: 3852921762 us: [pid:9 tid:0x7fcd5fa0c4c0] HSA_AMD_AGENT_INFO_MEMORY_AVAIL query failed.
:3:hip_memory.cpp :777 : 3852921767 us: [pid:9 tid:0x7fcd5fa0c4c0] hipMemGetInfo: Returned hipErrorInvalidValue :
:3:hip_error.cpp :35 : 3852921769 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetLastError ( )
:3:hip_device_runtime.cpp :652 : 3852922327 us: [pid:9 tid:0x7fcd5fa0c4c0] hipSetDevice ( 0 )
:3:hip_device_runtime.cpp :656 : 3852922332 us: [pid:9 tid:0x7fcd5fa0c4c0] hipSetDevice: Returned hipSuccess :
Traceback (most recent call last):
File "/app/model/vllm_example.py", line 11, in TORCH_USE_HIP_DSA
to enable device-side assertions.
:1:hip_fatbin.cpp :83 : 3853425875 us: [pid:9 tid:0x7fcd5fa0c4c0] All Unique FDs are closed
is it solved?
is it solved?
Same problem on same gpu
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
System config: hostnamectl Operating System: Red Hat Enterprise Linux 8.7 (Ootpa) Kernel: Linux 4.18.0-425.3.1.el8.x86_64 Architecture: x86-64
rocm driver 5.7.0 amd driver: 5.7.0 vllm container version: embeddedllminfo/vllm-rocm vllm-v0.2.4 RHEL8.7 GPU:MI210
Also same config with RHEL8.8, It is working, But with 8.7 it is not working.