Open garrett361 opened 5 months ago
CC @jingxu10 @tye1, thank you!
Hello, thanks for reporting this issue. I will look into this issue and get back to you.
Thank you @YuningQiu , greatly appreciated!
Hello @garrett361, regarding the specific script mentioned in the GitHub issue, it currently does not overlapping function on PVC.
How it operates on the A100 GPU:
Reasons for incompatibility with PVC:
Hi @YuningQiu , thank you for the very detailed response! I have a few follow-ups.
By default, the initiation of the second allreduce is implicitly delayed until the first allreduce is complete. At this point, several compute tasks but only one collective have been sent to the PVC
1) Ah, you mean even the launch of the second allreduce kernel is delayed?
the destruction of the event at the end of the collective submission code snippet triggers an artificial wait for the collective to complete within the event destructor. This wait blocks the host thread from continuing.
2) And this means that the collective blocks any additional kernels being launched, irrespective of what Stream
they were sent to?
non-dependent kernels from multiple streams are executed in the order they were submitted.
3) This means that kernels are executed in launch order regardless of what stream they are put into? If so, I don't understand the utility of Stream
s.
Note: Even though oneCCL might use the copy command for data transfer by default, the copy and reduction operations are still interdependent. Therefore, the possibility of overlapping is restricted to the last compute task and a portion of the first allreduce.
4) I didn't quite understand this. What is the importance of the copy operation here with respect to overlapping?
Finally: I am a little confused about where in the stack the issue lies. Is there an obstruction to overlapping compute and comms at the hardware level? Or is it something in ipex
, torch-ccl
, elsewhere?
And for more color, all of the above seems consistent with what I have seen from the pytorch profiler.
These are traces of a very similar workload where I attempted to overlap comms and compute for two iterations on cuda
(A100) and xpu
(1550).
cuda
: both compute and comms operations launch kernels and return immediately on the host, as seen in the minuscule vertical lines preceding the cudaDeviceSynchronize
.
xpu
: compute launches kernels and returns immediately, but collectives block and span a long time period until the collective finishes.
I also isolated the xpu
cases where I perform only the compute or the comms individually. The same effects can be seen.
Compute only:
Comms only:
Hello @garrett361, thanks for providing more details. We will take them back and discuss internally. We will keep you posted with any updates.
Also, could you please share with us the PyTorch profiling file that you are showing above? Thanks a lot!
@YuningQiu hi, could you tell me why this was closed please?
I also see I never followed up with the profiling script, my apologies. I can do that next week.
HI @garrett361, I heard that one of or teams from Intel has been directly in touch with you on this issue, and you also created an issue on intel/touch-ccl GitHub repo. Do you want to keep this issue open? Thanks a lot!
Hi @YuningQiu yes, I had a very helpful chat with members of the team. We also said we’d track progress through these GitHub issues, so could you please reopen it?
I cross posted to torch-ccl since I wasn’t sure how that team and ipex interact, nor if they also track ipex issues.
Thanks!
Adding more traces of attempted overlap with other collectives, per Intel's request via direct communication. The results are all qualitatively similar:
All traces taken on Sunspot with versions: torch.__version__='2.1.0a0+cxx11.abi', ipex.__version__='2.1.10+xpu', torch_ccl.__version__='2.1.100+xpu'
All profiles created using the profile_comms_compute_overlap.py
script here with different --collective
args, and otherwise default values, on an single Sunspot (1550) node.
torch.distributed.all_gather
torch.distributed.all_gather_into_tensor
torch.distributed.reduce_scatter_tensor
torch.distributed.all_reduce
(I'm not sure why this one uses more streams than the above all_reduce
trace. The previous trace was taken on a different machine from Sunspot.)
Describe the bug
Communication and computation do not appear to overlap when launching kernels in different
xpu.Stream
s (on Intel GPU Max 1550s). Being able to overlap communication and communication is crucial for efficiency. DeepSpeed and FSDP both useStream
objects for this purpose, for instance.To test this, I am launching communication and compute in various permutations of using
Stream
s or not. Driver code which operates on bothxpu
andcuda
:Running the above on two A100s, I get:
Running on two Intel GPU Max 1550s, I get:
A clear speed-up can be seen when using
Stream
s in their various permutations on A100s, while no speedup is visible onxpu
. Absolute timings are not included above, but I have verified that the individual compute and comms times are comparable to each other in all cases.Is this expected? Is there anything clearly wrong with the test code? The SYCL docs seem to imply that overlap should be possible.
Are there are any relevant environment variables that I might need to set?
Versions