hardmaru / WorldModelsExperiments

World Models Experiments
626 stars 172 forks source link

python train.py gives a CalledProcessError #5

Closed kessler-frost closed 6 years ago

kessler-frost commented 6 years ago

When I run python train.py on the specified CPU system I get a very long error message ending with, Traceback (most recent call last): File "train.py", line 450, in <module> if "parent" == mpi_fork(args.num_worker+1): os.exit() File "train.py", line 424, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env) File "/home/neptune/anaconda3/lib/python3.5/subprocess.py", line 581, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['mpirun', '-np', '65', '/home/neptune/anaconda3/bin/python', '-u', 'train.py']' returned non-zero exit status 134 I searched for the exit status for mpirun but wasn't able to debug the issue.

hardmaru commented 6 years ago

Have you tried running ESTool with simple experiment to see if MPI is installed ok?

Also I think it is configured for 64 core machine. If you are using less cores pass in a flag to specify (instructions in ESTool or blogs)

On Tue, Jul 17, 2018 at 2:01 PM Sankalp Sanand notifications@github.com wrote:

When I run python train.py on the specified CPU system I get a very long error message ending with, Traceback (most recent call last): File "train.py", line 450, in if "parent" == mpi_fork(args.num_worker+1): os.exit() File "train.py", line 424, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env) File "/home/neptune/anaconda3/lib/python3.5/subprocess.py", line 581, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['mpirun', '-np', '65', '/home/neptune/anaconda3/bin/python', '-u', 'train.py']' returned non-zero exit status 134 I searched for the exit status for mpirun but wasn't able to debug the issue.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hardmaru/WorldModelsExperiments/issues/5, or mute the thread https://github.com/notifications/unsubscribe-auth/AGBoHoSDYbqfrbO2I9Rw7s9cpg9Vr8WYks5uHW-OgaJpZM4VSL4G .

kessler-frost commented 6 years ago

Yes, I've tried running ESTool with a simple experiment from your stool repo using python train.py bullet_racecar -n 8 -t 4 it was running without any issue/error. I even tried python train.py bullet_ant -e 16 -n 64 -t 4 after installing pybullet and it too ran successfully. But still was unable to perform the same on doom. And yeah, I am using a 64 core machine with 200GB RAM on gcloud for all of the experiments, just as you mentioned in the blog post.

davidADSP commented 6 years ago

Could be related to this: https://github.com/AppliedDataSciencePartners/WorldModels/issues/3

Ensure you've only got one MPI library on your machine (i.e. try running this if you're on Linux) sudo apt-get remove openmpi-bin

If you have multiple MPI's then comm.Get_size() returns 1, so the following assert statement fails num_worker = comm.Get_size() assert len(packet_list) == num_worker-1

kessler-frost commented 6 years ago

Tried that but it opens up a new box of errors like FileNotFoundError: [Errno 2] No such file or directory: 'mpirun' or lib12.so was not found and something like that.

Interestingly though, when I tried changing the number of cores from 64 to 32 or 24 by executing, python train.py -n 32 It started giving me the right thing, ('doomrnn', (1, 35, 269.67, 149.75, 480.81, 69.88, 0.09914, 269.67, 480))

I guess the issue just comes when we use 64 cores (which is odd)

davidADSP commented 6 years ago

Numbers of workers has to be less than the number of cores - how many cores have you got?

Try uninstalling open MPI and instead install mpich

sudo apt-get install mpich

kessler-frost commented 6 years ago

I. I've tried the following combinations which seemed to work (not uninstalling openmpi):

  1. 64 Core proc, python train.py -n 24 or python train.py 32
  2. 24 Core proc, python train.py -n 24

II. Which did not work include: with openmpi -

  1. 64 Core proc, python train.py with mpich -
  2. 64 Core proc, python train.py
  3. 64 Core proc, python train.py -n 32

Also, I'm using Anaconda 4.2 in all of my experiments because Python 3.6 was causing issues with boost libraries. I'd suggest if it's possible for someone to perform a clean installation of all the project dependencies on a 64 core machine then they should try the solution by @davidADSP as I've exhausted all of my gcloud credits and am stuck with a 24 core one with a new account.

davidADSP commented 6 years ago

Do you get the same problems with the car racing task or is it just doom?

kessler-frost commented 6 years ago

I don't know about a 64 core proc, but for 24 core python train.py -n 24 executes successfully for car racing task. For a while this issue was also present when using 24 core processor but I was able to work around that by installing stuff in this particular order, pip install tensorflow==1.8 gym==0.9.4 cma==2.2

conda install libgcc

apt-get install -y python-numpy cmake zlib1g-dev libjpeg-dev libboost-all-dev gcc libsdl2-dev wget unzip git

pip install mpi4py==2

pip install ppaquette-gym-doom

were you able to reproduce this error?

I think that this issue is caused by the error in any of the threads while executing them. When I carefully observed this I found that there were different reasons, for example one time I got the same AssertionError you referenced, Traceback (most recent call last): File "05_train_controller.py", line 461, in <module> main(args) File "05_train_controller.py", line 410, in main master() File "05_train_controller.py", line 319, in master send_packets_to_slaves(packet_list) File "05_train_controller.py", line 233, in send_packets_to_slaves assert len(packet_list) == num_worker-1 AssertionError

then one time I got this in between a whole screen of text, ImportError: libXft.so.2: cannot open shared object file: No such file or directory

So, I guess this is being caused due to dependency issues(the same one over all the threads).

kessler-frost commented 6 years ago

Now I've come across a new error when I created a completely new instance and did the installation as mentioned above then executed python train.py and this occurred ,

RuntimeError: can't start new thread

I guess all of the other errors were resolved by doing a clean installation in that order.

hardmaru commented 6 years ago

Hi @kessler-frost

I'm not sure how to resolve this to be honest. The only diff I see is the python version I used (3.5.2)

I ran train.py today on a fresh machine (to check another issue on another thread) for ~ half a day and it seemed to work on my machine:

https://github.com/hardmaru/WorldModelsExperiments/blob/master/doomrnn/trainlog/train.log.txt

kessler-frost commented 6 years ago

@hardmaru thank you. Even I don't understand why is this happening, we both are using the same Anaconda distribution (python 3.5.2). I guess I'll close this issue until someone comes across it again.

Antonio-git-lab commented 4 years ago

hello while I am running train.py Igot this error can someone help me please File "c:\Users\User\Desktop\GIT\WorldModelsExperiments-master\carracing\train.py", line 445, in if "parent" == mpi_fork(args.num_worker+1): os._exit() File "c:\Users\User\Desktop\GIT\WorldModelsExperiments-master\carracing\train.py", line 419, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env) File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 266, in check_call retcode = call(*popenargs, *kwargs) File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 247, in call with Popen(popenargs, **kwargs) as p: File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 676, in init restore_signals, start_new_session) File "C:\Users\User\AppData\Local\Programs\Python\Python35\lib\subprocess.py", line 957, in _execute_child startupinfo) FileNotFoundError: [WinError 2] The system cannot find the file specified