volcengine / veScale

A PyTorch Native LLM Training Framework
http://vescale.xyz
Apache License 2.0
572 stars 28 forks source link

[QUESTION] implementation of `get_p2p_cuda_stream_id` and `get_coll_cuda_stream_id` #46

Closed nooblyh closed 2 weeks ago

nooblyh commented 1 month ago

In the README of ndtimeline, you mentioned implementing interfaces to obtain the streams for NCCL communication, specifically get_p2p_cuda_stream_id and get_coll_cuda_stream_id. However, these interfaces seem not present in the patches directory. Do you plan to release the specific implementation of these interfaces?

vocaltract commented 1 month ago

Thanks a lot for your comments. We will supplement these two functions soon.

XLzed commented 3 weeks ago

Thanks a lot for your comments. We will supplement these two functions soon.

Does the ndtimeline support async nccl stream like grad_reduce/params_gather overlap and interleaved p2p overlap? I add get_stream_id api in pytorch and use ndtimeline in megatron-lm trying to get the timeline of async nccl communication, but when I enable cudaEvent record I found the behavior of async nccl communication changed to sequential execution with computation kernel, is this normal? And how to fix it?

XLzed commented 3 weeks ago

Thanks a lot for your comments. We will supplement these two functions soon.

Does the ndtimeline support async nccl stream like grad_reduce/params_gather overlap and interleaved p2p overlap? I add get_stream_id api in pytorch and use ndtimeline in megatron-lm trying to get the timeline of async nccl communication, but when I enable cudaEvent record I found the behavior of async nccl communication changed to sequential execution with computation kernel, is this normal? And how to fix it?

I found that the problem is caused by environment CUDA_DEVICE_MAX_CONNECTIONS=1, which is required by TP/SP communication overlap. But I don't know why it makes the async nccl op with ndtimer changed to sequential.

vocaltract commented 3 weeks ago

In the README of ndtimeline, you mentioned implementing interfaces to obtain the streams for NCCL communication, specifically get_p2p_cuda_stream_id and get_coll_cuda_stream_id. However, these interfaces seem not present in the patches directory. Do you plan to release the specific implementation of these interfaces?

You can check here

vocaltract commented 2 weeks ago

Now PR is merged