pytorch / ELF

ELF: a platform for game research with AlphaGoZero/AlphaZero reimplementation
Other
3.37k stars 566 forks source link

Illegal instruction (core dumped) happens when import rlpytorch in container #141

Closed Breath123 closed 5 years ago

Breath123 commented 5 years ago

Hello, Firstly, I build the image with Dockerfile under project root then run it. Secondly, I follow the part of "Training a Go bot" in README, and when I execute step 5, there is an issue that the script start_server.sh calls python(2.7) raather than python3.7. So I correct it and use python3.7 in start_server.sh. Then I re-excute step 5 and found that there is an error of "Illegal instruction (core dumped)" when import rlpytorch in train.py.

The environment I use in host: A Dell server with 20 cores, Host OS version: Red Hat 4.8.3-9 One Tesla V100 GPU

Could you help to investigate this issue? Thanks a lot.

qucheng commented 5 years ago

You need to compile the python lib in python 3.7 too

Breath123 commented 5 years ago

@qucheng Thank you for answering, I found that the python version is not 3.7 because the line of "RUN bash -c "source activate base && make -j4"" in Dockerfile not take effect when I run "docker run -it elf", then I execute "source activate base && make -j4" after running elf container, and the python link to python3.7 now. But the same issue described above still happens. How to compile the python lib? I already have tried commands of "make" and "make -j4".

qucheng commented 5 years ago

Can you paste the full error? You can specify pythonpath and pythonlib when make Also gcc/g++ need to be 7.x

Breath123 commented 5 years ago

@qucheng Commands I used and outputs like blew: 1.run the container: docker run --runtime=nvidia --name elf -it elf bash

  1. set pythonpath in the container: source scripts/devmode_set_pythonpath.sh 3.use python3.7: source activate base
  2. check env: base) root@fbcb15f0af15:/go-elf/ELF/scripts/elfgames/go# echo $PYTHONPATH /go-elf/ELF/src_py/:/go-elf/ELF/build/elf/:/go-elf/ELF/build/elfgames/go/: (base) root@fbcb15f0af15:/go-elf/ELF/scripts/elfgames/go# which python /root/miniconda3/bin/python (base) root@fbcb15f0af15:/go-elf/ELF/scripts/elfgames/go# python --version Python 3.7.1
  3. build: make 6.try to import rlpytorch: cd scripts/elfgames/go/ python -c "import rlpytorch" then the error of Illegal instruction happens with error log: (base) root@fbcb15f0af15:/go-elf/ELF/scripts/elfgames/go# python -c "import rlpytorch" Illegal instruction (core dumped) Also, It generated a core dumped file named core.30197, and I upload it to https://pan.baidu.com/s/1JWs9F30yhMGRhs1dotzwZw In addition, you can pull the docker image I used from Here.
Breath123 commented 5 years ago

@qucheng Thanks a lot for your kindly help. I switch to another server and run container with host network, though I don't know what makes it worked, this issue disappears. I will close this issue.