gsartoretti / PRIMAL

PRIMAL: Pathfinding via Reinforcement and Imitation Multi-Agent Learning -- Distributed RL/IL code for Multi-Agent Path Finding (MAPF)
MIT License
302 stars 77 forks source link

Unexpected Suspension occurs while training #6

Open cloud-tifa opened 4 years ago

cloud-tifa commented 4 years ago

Hi Guillaume,

First of all, thanks for the enlightening work on PRIMAL.

I cloned the code and attempted to train a new model with .py file transformed by .ipynb. Model inference works fine, so I proceed to attempting to train my own model.This is followed by installation of all dependencies and compilation of cpp_mstar. The code does work but it will be suspended with GPU and memory occupied but CPU not occupied. The training program didn't report any error even exception. This problem happens almost every training after a random number of episodes.

What I have modified is:

import keras.backend.tensorflow_backend as KTF

config = tf.ConfigProto()
config.gpu_options.allow_growth = True 
config.gpu_options.per_process_gpu_memory_fraction = 0.8 
sess = tf.Session(config=config)
KTF.set_session(sess)  
EXPERIENCE_BUFFER_SIZE = 64 #default is 128 
NUM_META_AGENTS        = 2 #default is 3

I have already created a conda environment for PRIMAL with cuda=10.0, cudnn=7.6.5, tensorflow-gpu=1.14 .

It would be great if you can provide some assistance to tackle the issues.

Best Wishes, Hongjun