Failed to run on H100 GPU with tensor para=8

sfc-gh-zhwang commented 12 months ago

The same setup works fine on A100x8, but on H100x8, saw below errors.

Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid:     30) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000001677b uct_iface_mp_chunk_alloc_inner()  /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/base/uct_mem.c:467
 2 0x000000000001677b uct_iface_mp_chunk_alloc()  /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/base/uct_mem.c:443
 3 0x0000000000052c4b ucs_mpool_grow()  /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/ucs/datastruct/mpool.c:266
 4 0x0000000000052ec9 ucs_mpool_get_grow()  /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/ucs/datastruct/mpool.c:316
 5 0x000000000001b418 uct_mm_iface_t_init()  /build-result/src/hpcx-v2.14-gcc-inbox-ubuntu22.04-cuda11-gdrcopy2-nccl2.16-x86_64/ucx-5e8621f95002cf2ad7135987c2a7dc32d4fc72fb/src/uct/sm/mm/base/mm_iface.c:821

bddppq commented 2 months ago

We have run into the same issue. Does anyone have any clue?

Wenhan-Tan commented 2 months ago

Hi @sfc-gh-zhwang , have you found a solution yet? I'm having the same issue here with running it on Kubernetes.

sphish commented 2 months ago

@Wenhan-Tan I just encountered the same issue. The reason I ran into this problem was that I had enabled hugepages on the physical machine, and UCX triggered a SIGBUS when trying to allocate memory using hugepages. Everything worked fine after I disabled hugepages.

Wenhan-Tan commented 2 months ago

@sphish Thank you! I saw another similar issue here (https://github.com/NVIDIA/TensorRT-LLM/issues/674) which uses TRT-LLM instead of FT. But in that issue, huge pages need be enabled. I'll try disabling huge pages first.

sphish commented 2 months ago

@sphish Thank you! I saw another similar issue here (NVIDIA/TensorRT-LLM#674) which uses TRT-LLM instead of FT. But in that issue, huge pages need be enabled. I'll try disabling huge pages first.

I think the key is that containers and bare metal need to have the same configuration

triton-inference-server / fastertransformer_backend

Failed to run on H100 GPU with tensor para=8 #166