Closed ZhiyiHu1999 closed 6 months ago
Hi, NPKit requires code modification in CCL library. Currently it’s officially supported by: 1) Azure MSCCL. This is MSCCL for NVIDIA production maintained by Azure. It’s a superset of NCCL and will periodically sync up with latest NCCL. 2) AMD RCCL. This is AMD RCCL. Both NPKit and MSCCL are already merged to upstream.
Thanks for the answer! But I think I mean a different thing. That is can we use NPKit to profile a real workload that is not generated by msccl-tests?
Yes just make sure that workload uses the communication binary with NPKit modifications.
OK, thanks!
I found in the example for MSCCL that the workload is launched by MSCCL tests, does that mean NPKit can only trace workloads generated by the MSCCL tests? Or NPKit can trace arbitrary workloads that involves GPU communications?