mcrl / tccl

Thunder Research Group's Collective Communication Library
Other
20 stars 3 forks source link

pathfinder error #1

Open zfy3000163 opened 3 months ago

zfy3000163 commented 3 months ago

When I execute the program pathfinder, I get the error shown in the screenshot below. What is the reason for this? Thanks! 1

zfy3000163 commented 3 months ago

Do you have a prepared container image file?

csehydrogen commented 3 months ago

Hello, TCCL currently only considers single-NIC systems, as it is specialized to find a single ring path. For the "No rank available" error, the number of spawned MPI processes seems too small. If the number of GPU in a system is 8, you need to spawn at least 19 (= 3 + 8 * 2) processes per node, on all 3 nodes.

zfy3000163 commented 3 months ago

https://github.com/mcrl/tccl/issues/1#issuecomment-2099760936 Thank you very much for your answer, it is very useful!

Kelvin-Ng commented 3 months ago

Is there a fundamental reason multi-NIC systems are not supported?

Do you have any suggestion on where to start if I would like to extend it to support multi-NIC?