Closed mrektor closed 3 years ago
I'm working with @mrektor on a very similar project, using a different machine and I also experimenting the same kind of problems
Can you specify your keras, tensorflow, CUDA and cudNN versions?
@ParikhKadam As is already shown in the post (see the output of tf_env.txt):
Keras 2.2.4 TF 1.11.0 CUDA 9.2 cudNN 7.2.1
We fixed the problem with a fresh installation of CUDA, cuDNN & Tensorflow
EDIT: the fresh installation has been done on Ubuntu 18.04
I'm using Keras to train a convolutional neural network using the fit_generator function as whole dataset of the images are stored in .npy files and don't fit in memory. While with fit() I didn't have any problems (using a small subset of the entire dataset), after some experiments with fit_generator my scripts started show a strange behaviour: usually I'm not able to train the model as it gets stuck in the middle of the first epoch, or it crashes saying 'GPU sync failed', but most of the times 'CUDA_ERROR_LAUNCH_FAILED' (see the logs below).
The training using the CPUs works well but of course it is slower.
For implementing the custom generator i followed the best practide depicted in https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
As shown below:
Tensorflow and Keras were installed with conda:
conda install -c conda-forge keras
I have used this script https://github.com/tensorflow/tensorflow/blob/master/tools/tf_env_collect.sh to collect the following informations:
Here the tf_env.txt
I looked around everywhere in the internet but didn't find anyone with these problems (or any solution). My hypotesis is that when aborting a fit_generator job there are phantom threads around the machines. I tested this idea by rebooting the system but sometimes it works and sometimes it doesn't work.
The problem is that this is sort of a "random" bug, in the sense that i can't identify a deterministic reason for this behaviour. I tryed to play with every argument of the fit_generator() function, without any success.
This is an example of the training script:
Which lunched the following error:
Any ideas of what might be causing this issue?