Open laserkelvin opened 1 year ago
@jingxu10 @tye1 pls help.
Thanks. @laserkelvin It has been documented in IPEX side, see https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/getting_started.html, Note: Please import intel_extension_for_pytorch right after import torch, prior to importing other packages.
We will update torch-ccl README to emphasize this requirement too.
Problem
When using
oneccl_bindings_for_pytorch
withintel_extension_for_pytorch
including Intel GPU support, the ordering of the import statements is important for functionality and does not seem to be documented in the repository or anywhere else I have found.intel_extension_for_pytorch
needs to be imported first beforeoneccl_bindings_for_pytorch
, otherwise the collectives for GPU will not be recognized:Minimum example to reproduce
Below is a minimum working example that demonstrate the error:
oneccl_bindings_for_pytorch
is imported before IPEX, and throws an error saying thatallgather
is not implemented on[xpu]
. The below is called usingmpirun -n 4 -genvall -bootstrap ssh python ccl_test.py
.The error:
This will also trigger for other collectives (e.g.
allgather
). The code will run successfully if you import IPEX first, followed by oneCCL.Proposed solution
Please add documentation regarding this behavior: it is actually expected since IPEX and oneCCL act on
torch
dynamically, but this is not documented and may confuse users.