intel / torch-ccl

oneCCL Bindings for Pytorch*
BSD 3-Clause "New" or "Revised" License
86 stars 25 forks source link

Ordering of Intel extension imports not documented #44

Open laserkelvin opened 1 year ago

laserkelvin commented 1 year ago

Problem

When using oneccl_bindings_for_pytorch with intel_extension_for_pytorch including Intel GPU support, the ordering of the import statements is important for functionality and does not seem to be documented in the repository or anywhere else I have found.

intel_extension_for_pytorch needs to be imported first before oneccl_bindings_for_pytorch, otherwise the collectives for GPU will not be recognized:

Minimum example to reproduce

Below is a minimum working example that demonstrate the error: oneccl_bindings_for_pytorch is imported before IPEX, and throws an error saying that allgather is not implemented on [xpu]. The below is called using mpirun -n 4 -genvall -bootstrap ssh python ccl_test.py.

import oneccl_bindings_for_pytorch
import intel_extension_for_pytorch as ipex

rank = int(os.environ["PMI_RANK"])
world_size = int(os.environ["PMI_SIZE"])

torch.manual_seed(rank)

os.environ["RANK"] = str(rank)
os.environ["WORLD_SIZE"] = str(world_size)
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "21616"

group = dist.init_process_group(backend="ccl")

# generate random data on XPU
data = torch.rand(16, 8, device=f"xpu:{rank}")
if dist.get_rank() == 0:
    print(f"Initializing XPU data for rank {rank}")
    print(data)
    print(f"Performing all reduce for {world_size} ranks")

dist.all_reduce(data)
dist.barrier()
if dist.get_rank() == 0:
    print(f"All reduce done")
    print(data)

The error:

Performing all reduce for 4 ranks
Traceback (most recent call last):
  File "ccl_test.py", line 42, in <module>
    dist.all_reduce(data)
  File ".../lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1534, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: oneccl_bindings_for_pytorch: allreduce isn't implementd on backend [xpu].

This will also trigger for other collectives (e.g. allgather). The code will run successfully if you import IPEX first, followed by oneCCL.

Proposed solution

Please add documentation regarding this behavior: it is actually expected since IPEX and oneCCL act on torch dynamically, but this is not documented and may confuse users.

gujinghui commented 1 year ago

@jingxu10 @tye1 pls help.

tye1 commented 1 year ago

Thanks. @laserkelvin It has been documented in IPEX side, see https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/getting_started.html, Note: Please import intel_extension_for_pytorch right after import torch, prior to importing other packages.

We will update torch-ccl README to emphasize this requirement too.