Closed BenBrock closed 1 month ago
I was never able to get a functional setup using the Conda instructions from the AI Tools Selector. I had better luck creating a Conda environment, then installing the wheels distributed by Intel directly with pip.
# Create a new Conda environment.
conda create -n ipex python=3.10
conda activate ipex
# Install binary distributions of PyTorch, IPEX, and oneCCL.
pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torch-2.3.1%2Bcxx11.abi-cp310-cp310-linux_x86_64.whl
pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/intel_extension_for_pytorch-2.3.110%2Bxpu-cp310-cp310-linux_x86_64.whl
pip install https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/oneccl_bind_pt-2.3.100%2Bxpu-cp310-cp310-linux_x86_64.whl
# Source oneCCL and Intel MPI, which should have been previously installed at the system level.
# oneCCL/Intel MPI versions should be validated to work with corresponding version of IPEX/Torch CCL.
# Where you find this information, I'm not sure, but 2021.13 *should* work with PyTorch/IPEX/Torch CCL 2.3.110.
source /opt/intel/oneapi/ccl/2021.13/env/vars.sh
source /opt/intel/oneapi/mpi/2021.13/env/vars.sh
# Your OpenCL vendors environment may have been over-written by Conda. Reset it to the system level OpenCL vendors.
export OCL_ICD_VENDORS=/etc/OpenCL/vendors
Simple examples like the above should now work. I did variously get errors about the transformers
package being missing (which I resolved with pip install transformers
) as well as the warning about CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK
. Setting CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0
seemed to resolve the issue, and I got roughly the bandwidth I would expect from Xe Link.
I am encountering flaky seg faults (about once every 10 executions) for a very simple PyTorch example that I believe should work correctly. I am running on a system with 4 Intel GPU Max 1550 GPUs (8 PVC tiles).
I installed Torch CCL using the Conda as advised by the Intel AI Tools Selector. (AI Tools, Customize, Conda, Python 3.10, Intel Extension for PyTorch (GPU))
My environment is default except for the following changes:
ZE_AFFINITY_MASK=0,1,2,3,4,5,6,7
so that all 8 PVC tiles are visible to every process.CCL_ZE_IPC_EXCHANGE
, which is set tosockets
by default.CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0
to avoid using PCIe, as prompted by CCL warnings.As mentioned, I'm running on a system with 8 PVC tiles. The oneAPI runtime, Intel MPI, etc. are all installed by Conda.
Please let me know what I should do to troubleshoot this issue.