microsoft / msccl-tools

Synthesizer for optimal collective communication algorithms
MIT License
98 stars 25 forks source link

Set sizes for algorithm registrations #13

Closed olsaarik closed 2 years ago

olsaarik commented 2 years ago

For the NDV2 algorithm the size is based on numbers for 2 nodes. For NDV4 the numbers are based on 16 nodes. As these numbers may actually be dependent on number of machines, in future the future we should allow sizes to be sensitive to number of machines.

Also includes a miscellaneous fix for making the NCCL_ALGOS logic more robust.