gsartoretti / PRIMAL

PRIMAL: Pathfinding via Reinforcement and Imitation Multi-Agent Learning -- Distributed RL/IL code for Multi-Agent Path Finding (MAPF)
MIT License
323 stars 79 forks source link

Multiple Errors on CPU and GPU during Training #3

Closed hmhyau closed 5 years ago

hmhyau commented 5 years ago

Hi Guillaume,

First of all, thanks for the enlightening work on PRIMAL.

I cloned the code and attempted to train a new model by the Jupyter notebook on a NVIDIA DGX Station. Model inference works fine, so I proceed to attempting to train my own model. As per NVIDIA's instructions, I created a new Docker image and run the code inside a Docker container. This is followed by installation of all dependencies and compilation of cpp_mstar. This doesn't work and I got different errors almost every time when I run the notebook.

  1. CUBLAS_STATUS_NOT_INITIALIZED

    2019-06-21 07:32:02.022871: E tensorflow/stream_executor/cuda/cuda_blas.cc:524] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED

  2. CUDA OOM Error

    2019-06-21 06:13:38.667886: E tensorflow/stream_executor/cuda/cuda_driver.cc:868] failed to alloc 33554432 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory

  3. On rare occasions when it can be run - std::system_error: Resource temporarily not available

  4. On rare occasions when it can be run - IndexError: too many indices for array

Exception in thread Thread-2: Traceback (most recent call last): File "/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "DRLMAPF_A3C_RNN.py", line 593, in worker_work = lambda: worker.work(max_episode_length,gamma,sess,coord,saver) File "DRLMAPF_A3C_RNN.py", line 262, in work i_l=self.train(rollouts[self.metaAgentID][self.agentID-1], sess, gamma, None,imitation=True) File "DRLMAPF_A3C_RNN.py", line 103, in train self.local_AC.inputs:np.stack(rollout[:,0]), IndexError: too many indices for array

There are several changes I've made over the past week to try to make it run properly, including decreasing number of threads and/or number of meta-agents but these doesn't help. Converting the notebook to .py script also fails.

CPU training fails too with ResourceExhausedError:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1024,2048] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [[node gradients_5/worker_3/qvalues/rnn/while/basic_lstm_cell/MatMul_grad/MatMul_1 (defined at /distributedRL/ACNet.py:67) ]]

which doesn't make much sense with 256GB RAM equipped in the server.

It would be great if you can provide some assistance to tackle the issues.

Best, Herman

hmhyau commented 5 years ago

After a few days of digging I've located the origin of these problems. It turns out that too little memory is allocated for ODrM*. Once I removed these two lines in cython_od_mstar.pyx and recompile, it works fine as long as there is enough RAM.

# Comment or remove these lines
    # import resource
    # resource.setrlimit(resource.RLIMIT_AS, (2**33,2**33)) # 8Gb

Closing the issue as it is solved as of now.