Multiple Errors on CPU and GPU during Training

Hi Guillaume,

First of all, thanks for the enlightening work on PRIMAL.

I cloned the code and attempted to train a new model by the Jupyter notebook on a NVIDIA DGX Station. Model inference works fine, so I proceed to attempting to train my own model. As per NVIDIA's instructions, I created a new Docker image and run the code inside a Docker container. This is followed by installation of all dependencies and compilation of cpp_mstar. This doesn't work and I got different errors almost every time when I run the notebook.

CUBLAS_STATUS_NOT_INITIALIZED

2019-06-21 07:32:02.022871: E tensorflow/stream_executor/cuda/cuda_blas.cc:524] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
CUDA OOM Error

2019-06-21 06:13:38.667886: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 33554432 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
On rare occasions when it can be run - std::system_error: Resource temporarily not available
On rare occasions when it can be run - IndexError: too many indices for array

Exception in thread Thread-2: Traceback (most recent call last): File "/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "DRLMAPF_A3C_RNN.py", line 593, in worker_work = lambda: worker.work(max_episode_length,gamma,sess,coord,saver) File "DRLMAPF_A3C_RNN.py", line 262, in work i_l=self.train(rollouts[self.metaAgentID][self.agentID-1], sess, gamma, None,imitation=True) File "DRLMAPF_A3C_RNN.py", line 103, in train self.local_AC.inputs:np.stack(rollout[:,0]), IndexError: too many indices for array

There are several changes I've made over the past week to try to make it run properly, including decreasing number of threads and/or number of meta-agents but these doesn't help. Converting the notebook to .py script also fails.

CPU training fails too with ResourceExhausedError:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1024,2048] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [[node gradients_5/worker_3/qvalues/rnn/while/basic_lstm_cell/MatMul_grad/MatMul_1 (defined at /distributedRL/ACNet.py:67) ]]

which doesn't make much sense with 256GB RAM equipped in the server.

It would be great if you can provide some assistance to tackle the issues.

Best, Herman

gsartoretti / PRIMAL

Multiple Errors on CPU and GPU during Training #3