intel / torch-ccl

oneCCL Bindings for Pytorch*
BSD 3-Clause "New" or "Revised" License
83 stars 23 forks source link

demo.py segment fault #37

Open mycprotein opened 1 year ago

mycprotein commented 1 year ago

I have settled the environment like this: sudo apt install openmpi-common openmpi-bin pip install torch == 1.12 pip install oneccl_bind_pt==1.12.0 -f https://developer.intel.com/ipex-whl-stable source <oneccl_bindings_for_pytorch_path>/env/setvars.sh Then I execute the demo.py file using python demo.py or using mpirun, but in both way I get a segmentation fault. I think maybe there are some misuses, but I cannot find detailed documents. I wonder how can I make it work?

mycprotein commented 1 year ago

I have checked that I use torch==1.12.0.

zhouyu5 commented 1 year ago

I had the exact same problem, hope it will be fixed soon, thanks.

zxd1997066 commented 1 year ago

Since for release 1.12.0, pytorch has two version: with +cpu or not. You may download torch-1.12.0+cpu from https://download.pytorch.org/whl/cpu/torch/ and it can work. image

zhouyu5 commented 1 year ago

Thanks for replying @zxd1997066

Actually we are using pytorch with cpu version from OneAPI docker images, say:

docker pull intel/oneapi-aikit:2022.3.0-devel-ubuntu18.04

So I can gurantee you that we are using a CPU version Pytorch.

In the pictures above, you show that it works fine in single node, but I suggest you try to run it distributedly, that is, to run it by using the MPI tool set, just as illustrated in the Github homepage, using the command like:

mpirun -n <N> -ppn <PPN> -f <hostfile> python example.py

Me and my colleagues got the same segmentation fault error. Could you please have a try?

Thank you very much!

zxd1997066 commented 1 year ago

I am sorry, I can not reproduce it. Untitled picture Untitled picture2

zigzagcai commented 1 year ago

Hello, I have tested with docker pull intel/oneapi-aikit:2022.3.1-devel-ubuntu18.04. The same error accured.

zigzagcai commented 1 year ago

I am sorry, I can not reproduce it. Untitled picture Untitled picture2

Hi bro, which docker base image are you using? I have tried with oneapi-aikit base image but it reported error.

zxd1997066 commented 1 year ago

I am sorry, I can not reproduce it. Untitled picture Untitled picture2

Hi bro, which docker base image are you using? I have tried with oneapi-aikit base image but it reported error.

I just run it in conda, so I guess maybe there is something wrong with that docker, and this needs further investigate.