microsoft / NPKit

NCCL Profiling Kit
MIT License
104 stars 11 forks source link

Time Synchronization Problem between GPU and CPU in NPKit #32

Open ZhiyiHu1999 opened 2 weeks ago

ZhiyiHu1999 commented 2 weeks ago

Hello! In NPKit, we have a thread NpKit::CpuTimestampUpdateThread() looping to update the cpu timetamp and the updated value is given to a pointer cpu_timestamp_. To synchronize between CPU and GPU, NPKit has CPU SYNC event and GPU SYNC event happening at almost the same time and documents the values read from the pointer cpu_timestamp_ and clock64().

However, from my experiments, I think the cpu timestamp obtained in CPU SYNC event is not the correct value because cache coherence in the system may not be strong enough to ensure every update in the NpKit::CpuTimestampUpdateThread() writes to the memory and we may not get the most up-to-date value in CPU SYNC event even if we always use volatile in the code. Could I ask whether your team have noticed the problem and do you have any way to settle it? Thanks a lot!

yzygitzh commented 2 weeks ago

Hi,

We didn’t encounter the issue you mentioned when collecting CPU timestamps from GPU. We however noticed that some bad NUMA binding might make CPU timestamp less up-to-date.

According to this code piece updating latest FIFO pointer from CPU to GPU (https://github.com/Azure/msccl-executor-nccl/blob/main/src/transport/net.cc#L1102), writing a volatile pointer allocated by ncclCudaHostCalloc should be a feasible way to make CPU memory writes visible to GPU. Note that, however, that code piece applies two extra memory buffers, which enforces memory operation order, but should not change whether some CPU write is finally visible to GPU or not.

ZhiyiHu1999 commented 2 weeks ago

Thanks for the reply! Could you please kindly elaborate on Some bad NUMA binding might make CPU timestamp less up-to-date. because I suspect my problem is caused by the same reason. By the way, is there an effective way to tackle the synchronization problem in such system. Thanks a lot!