microsoft / NPKit

NCCL Profiling Kit
MIT License
109 stars 12 forks source link

Can NPKit only trace workloads launched by MSCCL tests or more than that #27

Closed ZhiyiHu1999 closed 6 months ago

ZhiyiHu1999 commented 6 months ago

I found in the example for MSCCL that the workload is launched by MSCCL tests, does that mean NPKit can only trace workloads generated by the MSCCL tests? Or NPKit can trace arbitrary workloads that involves GPU communications?

yzygitzh commented 6 months ago

Hi, NPKit requires code modification in CCL library. Currently it’s officially supported by: 1) Azure MSCCL. This is MSCCL for NVIDIA production maintained by Azure. It’s a superset of NCCL and will periodically sync up with latest NCCL. 2) AMD RCCL. This is AMD RCCL. Both NPKit and MSCCL are already merged to upstream.

ZhiyiHu1999 commented 6 months ago

Thanks for the answer! But I think I mean a different thing. That is can we use NPKit to profile a real workload that is not generated by msccl-tests?

yzygitzh commented 6 months ago

Yes just make sure that workload uses the communication binary with NPKit modifications.

ZhiyiHu1999 commented 6 months ago

OK, thanks!