Open guanyonglai opened 1 year ago
[8d6bf97c3bf4:3039 :0:3095] Caught signal 7 (Bus error: nonexistent physical address) ==== backtrace (tid: 3095) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000018bb41 __nss_database_lookup() ???:0 2 0x000000000007587d ncclGroupEnd() ???:0 3 0x000000000007b0ef ncclGroupEnd() ???:0 4 0x0000000000059e97 ncclGetUniqueId() ???:0 5 0x00000000000489b1 ???() /usr/lib/x86_64-linux-gnu/libnccl.so.2:0 6 0x000000000004a655 ???() /usr/lib/x86_64-linux-gnu/libnccl.so.2:0 7 0x0000000000063dcc ncclRedOpDestroy() ???:0 8 0x0000000000008609 start_thread() ???:0 9 0x000000000011f133 clone() ???:0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
这个是因为docker run的时候默认分配的共享内存不够,只有64M。可以在docker run的时候加上--shm-size="6g"用于自定义分配更多共享内存。
[8d6bf97c3bf4:3039 :0:3095] Caught signal 7 (Bus error: nonexistent physical address) ==== backtrace (tid: 3095) ==== 0 0x0000000000043090 killpg() ???:0 1 0x000000000018bb41 __nss_database_lookup() ???:0 2 0x000000000007587d ncclGroupEnd() ???:0 3 0x000000000007b0ef ncclGroupEnd() ???:0 4 0x0000000000059e97 ncclGetUniqueId() ???:0 5 0x00000000000489b1 ???() /usr/lib/x86_64-linux-gnu/libnccl.so.2:0 6 0x000000000004a655 ???() /usr/lib/x86_64-linux-gnu/libnccl.so.2:0 7 0x0000000000063dcc ncclRedOpDestroy() ???:0 8 0x0000000000008609 start_thread() ???:0 9 0x000000000011f133 clone() ???:0 torch.distributed.elastic.multiprocessing.errors.ChildFailedError: