Closed kessler-frost closed 6 years ago
Have you tried running ESTool with simple experiment to see if MPI is installed ok?
Also I think it is configured for 64 core machine. If you are using less cores pass in a flag to specify (instructions in ESTool or blogs)
On Tue, Jul 17, 2018 at 2:01 PM Sankalp Sanand notifications@github.com wrote:
When I run python train.py on the specified CPU system I get a very long error message ending with, Traceback (most recent call last): File "train.py", line 450, in
if "parent" == mpi_fork(args.num_worker+1): os.exit() File "train.py", line 424, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env) File "/home/neptune/anaconda3/lib/python3.5/subprocess.py", line 581, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['mpirun', '-np', '65', '/home/neptune/anaconda3/bin/python', '-u', 'train.py']' returned non-zero exit status 134 I searched for the exit status for mpirun but wasn't able to debug the issue. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hardmaru/WorldModelsExperiments/issues/5, or mute the thread https://github.com/notifications/unsubscribe-auth/AGBoHoSDYbqfrbO2I9Rw7s9cpg9Vr8WYks5uHW-OgaJpZM4VSL4G .
Yes, I've tried running ESTool with a simple experiment from your stool repo using python train.py bullet_racecar -n 8 -t 4
it was running without any issue/error. I even tried python train.py bullet_ant -e 16 -n 64 -t 4
after installing pybullet and it too ran successfully. But still was unable to perform the same on doom. And yeah, I am using a 64 core machine with 200GB RAM on gcloud for all of the experiments, just as you mentioned in the blog post.
Could be related to this: https://github.com/AppliedDataSciencePartners/WorldModels/issues/3
Ensure you've only got one MPI library on your machine (i.e. try running this if you're on Linux)
sudo apt-get remove openmpi-bin
If you have multiple MPI's then comm.Get_size() returns 1, so the following assert statement fails
num_worker = comm.Get_size()
assert len(packet_list) == num_worker-1
Tried that but it opens up a new box of errors like
FileNotFoundError: [Errno 2] No such file or directory: 'mpirun'
or
lib12.so was not found and something like that.
Interestingly though, when I tried changing the number of cores from 64 to 32 or 24 by executing,
python train.py -n 32
It started giving me the right thing,
('doomrnn', (1, 35, 269.67, 149.75, 480.81, 69.88, 0.09914, 269.67, 480))
I guess the issue just comes when we use 64 cores (which is odd)
Numbers of workers has to be less than the number of cores - how many cores have you got?
Try uninstalling open MPI and instead install mpich
sudo apt-get install mpich
I. I've tried the following combinations which seemed to work (not uninstalling openmpi):
python train.py -n 24
or python train.py 32
python train.py -n 24
II. Which did not work include: with openmpi -
python train.py
with mpich - python train.py
python train.py -n 32
Also, I'm using Anaconda 4.2 in all of my experiments because Python 3.6 was causing issues with boost libraries. I'd suggest if it's possible for someone to perform a clean installation of all the project dependencies on a 64 core machine then they should try the solution by @davidADSP as I've exhausted all of my gcloud credits and am stuck with a 24 core one with a new account.
Do you get the same problems with the car racing task or is it just doom?
I don't know about a 64 core proc, but for 24 core python train.py -n 24
executes successfully for car racing task. For a while this issue was also present when using 24 core processor but I was able to work around that by installing stuff in this particular order,
pip install tensorflow==1.8 gym==0.9.4 cma==2.2
conda install libgcc
apt-get install -y python-numpy cmake zlib1g-dev libjpeg-dev libboost-all-dev gcc libsdl2-dev wget unzip git
pip install mpi4py==2
pip install ppaquette-gym-doom
were you able to reproduce this error?
I think that this issue is caused by the error in any of the threads while executing them. When I carefully observed this I found that there were different reasons,
for example one time I got the same AssertionError
you referenced,
Traceback (most recent call last): File "05_train_controller.py", line 461, in <module> main(args) File "05_train_controller.py", line 410, in main master() File "05_train_controller.py", line 319, in master send_packets_to_slaves(packet_list) File "05_train_controller.py", line 233, in send_packets_to_slaves assert len(packet_list) == num_worker-1 AssertionError
then one time I got this in between a whole screen of text,
ImportError: libXft.so.2: cannot open shared object file: No such file or directory
So, I guess this is being caused due to dependency issues(the same one over all the threads).
Now I've come across a new error when I created a completely new instance and did the installation as mentioned above then executed python train.py
and this occurred ,
RuntimeError: can't start new thread
I guess all of the other errors were resolved by doing a clean installation in that order.
Hi @kessler-frost
I'm not sure how to resolve this to be honest. The only diff I see is the python version I used (3.5.2)
I ran train.py today on a fresh machine (to check another issue on another thread) for ~ half a day and it seemed to work on my machine:
https://github.com/hardmaru/WorldModelsExperiments/blob/master/doomrnn/trainlog/train.log.txt
@hardmaru thank you. Even I don't understand why is this happening, we both are using the same Anaconda distribution (python 3.5.2). I guess I'll close this issue until someone comes across it again.
hello while I am running train.py Igot this error can someone help me please
File "c:\Users\User\Desktop\GIT\WorldModelsExperiments-master\carracing\train.py", line 445, in
When I run python train.py on the specified CPU system I get a very long error message ending with,
Traceback (most recent call last): File "train.py", line 450, in <module> if "parent" == mpi_fork(args.num_worker+1): os.exit() File "train.py", line 424, in mpi_fork subprocess.check_call(["mpirun", "-np", str(n), sys.executable] +['-u']+ sys.argv, env=env) File "/home/neptune/anaconda3/lib/python3.5/subprocess.py", line 581, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['mpirun', '-np', '65', '/home/neptune/anaconda3/bin/python', '-u', 'train.py']' returned non-zero exit status 134
I searched for the exit status for mpirun but wasn't able to debug the issue.