rll / rllab

rllab is a framework for developing and evaluating reinforcement learning algorithms, fully compatible with OpenAI Gym.
Other
2.91k stars 800 forks source link

Failed cluster_demo.py on EC2 #234

Open brucewayne1248 opened 6 years ago

brucewayne1248 commented 6 years ago

I tried to run the cluster_demo.py on EC2. The instance starts fine but gets terminated shortly after. I get the following traceback in the stdout.log

sync initiated log sync initiated Running in docker I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally I tensorflow/stream_executor/dso_loader.cc:126] Couldn't open CUDA library libcuda.so.1. LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: 83526cf8e682 I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: Permission denied: could not open driver version path for reading: /proc/driver/nvidia/version I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1065] LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1066] failed to find libcuda.so on this system: Failed precondition: could not dlopen DSO: libcuda.so.1; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally using seed 1 2018-05-31 09:18:27.844271 UTC | Setting seed to 1 using seed 1 /opt/conda/envs/rllab3/lib/python3.5/site-packages/theano/tensor/signal/downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module. "downsample module has been moved to the theano.tensor.signal.pool module.") Traceback (most recent call last): File "/root/code/rllab/scripts/run_experiment_lite.py", line 137, in <module> run_experiment(sys.argv) File "/root/code/rllab/scripts/run_experiment_lite.py", line 120, in run_experiment method_call = cloudpickle.loads(base64.b64decode(args.args_data)) File "/opt/conda/envs/rllab3/lib/python3.5/site-packages/cloudpickle/cloudpickle.py", line 800, in _make_skel_func closure = _reconstruct_closure(closures) if closures else None File "/opt/conda/envs/rllab3/lib/python3.5/site-packages/cloudpickle/cloudpickle.py", line 792, in _reconstruct_closure return tuple([_make_cell(v) for v in values]) TypeError: 'int' object is not iterable

Any help? If additional information is necessary, I am ready to provide it.