oneapi-src / oneAPI-samples

Samples for Intel® oneAPI Toolkits
https://oneapi-src.github.io/oneAPI-samples/
MIT License
899 stars 675 forks source link

RuntimeError: could not create an engine #2379

Open aruhela opened 1 week ago

aruhela commented 1 week ago

Hi Intel Team

I am observing "could not create an engine" error in executing demo.py example from "*oneCCL Bindings for PyTorch Getting Started Sample**". The code is run on Saphire node with 4 PVCs at TACC system. Any suggestions on identifying the cause and fixing it?

(base) c551-003pvc$ mpirun -n 2 -l python demo.py -dev xpu [0] Runing Iteration: 0 on device xpu:0 [0] Runing forward: 0 on device xpu:0 [0] Traceback (most recent call last): [0] File "/scratch/05231/aruhela/demo.py", line 67, in [1] Runing Iteration: 0 on device xpu:1 [1] Runing forward: 0 on device xpu:1 [1] Traceback (most recent call last): [1] File "/scratch/05231/aruhela/demo.py", line 67, in [0] res = model(input) [0] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl [1] res = model(input) [1] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl [0] return self._call_impl(*args, kwargs) [0] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl [1] return self._call_impl(*args, *kwargs) [1] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl [0] return forward_call(args, kwargs) [0] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward [1] return forward_call(*args, kwargs) [1] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward [0] else self._run_ddp_forward(*inputs, *kwargs) [0] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward [1] else self._run_ddp_forward(inputs, kwargs) [1] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward [0] return self.module(*inputs, kwargs) # type: ignore[index] [0] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl [1] return self.module(*inputs, *kwargs) # type: ignore[index] [1] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl [0] return self._call_impl(args, kwargs) [0] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl [1] return self._call_impl(*args, kwargs) [1] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl [0] return forward_call(*args, *kwargs) [0] File "/scratch/05231/aruhela/demo.py", line 26, in forward [1] return forward_call(args, kwargs) [1] File "/scratch/05231/aruhela/demo.py", line 26, in forward [0] return self.linear(input) [0] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl [1] return self.linear(input) [1] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl [0] return self._call_impl(*args, kwargs) [0] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl [1] return self._call_impl(*args, *kwargs) [1] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl [0] return forward_call(args, kwargs) [0] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward [1] return forward_call(*args, **kwargs) [1] File "/scratch/projects/compilers/intel24.2/oneapi/intelpython/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward [0] return F.linear(input, self.weight, self.bias) [0] RuntimeError: could not create an engine [1] return F.linear(input, self.weight, self.bias) [1] RuntimeError: could not create an engine (base) c551-003pvc$

Notes: OneAPI release is 2024.2 Install command (AI Selector Tool): conda install -c intel -c conda-forge --override-channels intel/label/oneapi::intel-extension-for-pytorch=2.1.20 intel/label/oneapi::pytorch=2.1.0 intel/label/oneapi::oneccl_bind_pt=2.1.200 intel/label/oneapi::torchvision=0.16.0 intel/label/oneapi::torchaudio=2.1.0 conda-forge::deepspeed=0.14.0 python=3.9

Thanks Amit Ruhela