Flaky Seg Faults with AllReduce

intel / torch-ccl

oneCCL Bindings for Pytorch*

BSD 3-Clause "New" or "Revised" License

86 stars 25 forks source link

#!/usr/bin/env python import os import sys import torch import torch.distributed as dist import intel_extension_for_pytorch as ipex import oneccl_bindings_for_pytorch as torch_ccl def get_device(): return 'xpu:%s' % (dist.get_rank() % torch.xpu.device_count(),) os.environ["MASTER_ADDR"] = "127.0.0.1" os.environ["MASTER_PORT"] = "29500" os.environ["RANK"] = os.environ.get("PMI_RANK") os.environ["WORLD_SIZE"] = os.environ.get("PMI_SIZE") dist.init_process_group(backend="ccl", init_method="env://") n = 1024*1024 tensor = torch.zeros(n, dtype=torch.float32, device=get_device()) # Perform an all_reduce to initialize communicators and such. dist.all_reduce(tensor)

I was never able to get a functional setup using the Conda instructions from the AI Tools Selector. I had better luck creating a Conda environment, then installing the wheels distributed by Intel directly with pip.

# Create a new Conda environment.
conda create -n ipex python=3.10
conda activate ipex

# Install binary distributions of PyTorch, IPEX, and oneCCL.
pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torch-2.3.1%2Bcxx11.abi-cp310-cp310-linux_x86_64.whl
pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/intel_extension_for_pytorch-2.3.110%2Bxpu-cp310-cp310-linux_x86_64.whl
pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/oneccl_bind_pt-2.3.100%2Bxpu-cp310-cp310-linux_x86_64.whl

# Source oneCCL and Intel MPI, which should have been previously installed at the system level.
# oneCCL/Intel MPI versions should be validated to work with corresponding version of IPEX/Torch CCL.
# Where you find this information, I'm not sure, but 2021.13 *should* work with PyTorch/IPEX/Torch CCL 2.3.110.
source /opt/intel/oneapi/ccl/2021.13/env/vars.sh
source /opt/intel/oneapi/mpi/2021.13/env/vars.sh
# Your OpenCL vendors environment may have been over-written by Conda.  Reset it to the system level OpenCL vendors.
export OCL_ICD_VENDORS=/etc/OpenCL/vendors

Simple examples like the above should now work. I did variously get errors about the transformers package being missing (which I resolved with pip install transformers) as well as the warning about CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK. Setting CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0 seemed to resolve the issue, and I got roughly the bandwidth I would expect from Xe Link.

intel / torch-ccl

Flaky Seg Faults with AllReduce #73