Open ZhiyiHu1999 opened 2 weeks ago
Hi,
We didn’t encounter the issue you mentioned when collecting CPU timestamps from GPU. We however noticed that some bad NUMA binding might make CPU timestamp less up-to-date.
According to this code piece updating latest FIFO pointer from CPU to GPU (https://github.com/Azure/msccl-executor-nccl/blob/main/src/transport/net.cc#L1102), writing a volatile pointer allocated by ncclCudaHostCalloc should be a feasible way to make CPU memory writes visible to GPU. Note that, however, that code piece applies two extra memory buffers, which enforces memory operation order, but should not change whether some CPU write is finally visible to GPU or not.
Thanks for the reply! Could you please kindly elaborate on Some bad NUMA binding might make CPU timestamp less up-to-date.
because I suspect my problem is caused by the same reason. By the way, is there an effective way to tackle the synchronization problem in such system. Thanks a lot!
Hello! In NPKit, we have a thread
NpKit::CpuTimestampUpdateThread()
looping to update the cpu timetamp and the updated value is given to a pointercpu_timestamp_
. To synchronize between CPU and GPU, NPKit has CPU SYNC event and GPU SYNC event happening at almost the same time and documents the values read from the pointercpu_timestamp_
andclock64()
.However, from my experiments, I think the cpu timestamp obtained in CPU SYNC event is not the correct value because cache coherence in the system may not be strong enough to ensure every update in the
NpKit::CpuTimestampUpdateThread()
writes to the memory and we may not get the most up-to-date value in CPU SYNC event even if we always usevolatile
in the code. Could I ask whether your team have noticed the problem and do you have any way to settle it? Thanks a lot!