keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.92k stars 19.46k forks source link

strange problems with fit_generator: CUDA_ERROR_LAUNCH_FAILED #12003

Closed mrektor closed 3 years ago

mrektor commented 5 years ago

I'm using Keras to train a convolutional neural network using the fit_generator function as whole dataset of the images are stored in .npy files and don't fit in memory. While with fit() I didn't have any problems (using a small subset of the entire dataset), after some experiments with fit_generator my scripts started show a strange behaviour: usually I'm not able to train the model as it gets stuck in the middle of the first epoch, or it crashes saying 'GPU sync failed', but most of the times 'CUDA_ERROR_LAUNCH_FAILED' (see the logs below).

The training using the CPUs works well but of course it is slower.

For implementing the custom generator i followed the best practide depicted in https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly

As shown below:

import numpy as np
from keras.utils import Sequence

class MAGIC_Generator(Sequence):
    def __init__(self, list_IDs, labels, batch_size=32, dim=(67, 68, 4), position=False, shuffle=True,
                 folder='path/npy_dump/all_npy'):
        'Initialization'
        self.dim = dim
        self.batch_size = batch_size
        self.labels = labels
        self.list_IDs = list_IDs
        # self.n_channels = n_channels
        # self.n_classes = n_classes
        self.shuffle = shuffle
        self.folder = folder
        self.position = position
        self.on_epoch_end()

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.list_IDs))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

    def __data_generation(self, list_IDs_temp):
        'Generates data containing batch_size samples'  # X : (n_samples, *dim, n_channels)
        # Initialization
        X = np.empty((self.batch_size, *self.dim))
        if self.position:
            y = np.empty((self.batch_size, 2), dtype=int)
        else:
            y = np.empty((self.batch_size), dtype=int)

        # Generate data
        for i, ID in enumerate(list_IDs_temp):
            # Store sample
            X[i,] = np.load(self.folder + '/' + ID + '.npy')

            # Store class
            y[i] = self.labels[ID]

        return X, y

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.list_IDs) / self.batch_size))

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index * self.batch_size:(index + 1) * self.batch_size]

        # Find list of IDs
        list_IDs_temp = [self.list_IDs[k] for k in indexes]

        # Generate data
        X, y = self.__data_generation(list_IDs_temp)

        return X, y

Tensorflow and Keras were installed with conda: conda install -c conda-forge keras

I have used this script https://github.com/tensorflow/tensorflow/blob/master/tools/tf_env_collect.sh to collect the following informations:

Here the tf_env.txt


Keras 2.2.4.

== cat /etc/issue ===============================================
Linux liph01.novalocal 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
VERSION="7 (Core)"
VERSION_ID="7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

== are we in docker =============================================
No

== compiler =====================================================
c++ (GCC) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== uname -a =====================================================
Linux liph01.novalocal 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
msgpack-numpy                      0.4.3.2    
numpy                              1.15.3     
numpydoc                           0.8.0      
protobuf                           3.6.0      
tensorflow                         1.11.0     

== check for virtualenv =========================================
False

== tensorflow import ============================================
tf.VERSION = 1.11.0
tf.GIT_VERSION = b'unknown'
tf.COMPILER_VERSION = b'unknown'

== env ==========================================================
LD_LIBRARY_PATH is unset
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
Thu Jan  3 17:38:44 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:00:07.0 Off |                  N/A |
| 40%   65C    P2    94W / 250W |  11747MiB / 12196MiB |     90%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     16991      C   python                                     11737MiB |
+-----------------------------------------------------------------------------+

== cuda libs  ===================================================
/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart_static.a
/usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudart.so.9.2.148
/usr/local/cuda-9.2/doc/man/man7/libcudart.7
/usr/local/cuda-9.2/doc/man/man7/libcudart.so.7

I looked around everywhere in the internet but didn't find anyone with these problems (or any solution). My hypotesis is that when aborting a fit_generator job there are phantom threads around the machines. I tested this idea by rebooting the system but sometimes it works and sometimes it doesn't work.

The problem is that this is sort of a "random" bug, in the sense that i can't identify a deterministic reason for this behaviour. I tryed to play with every argument of the fit_generator() function, without any success.

This is an example of the training script:

from keras.callbacks import ModelCheckpoint, EarlyStopping

from CNN4MAGIC.CNN_Models.BigData.clr import OneCycleLR
from CNN4MAGIC.Generator.gen_util import load_data_generators
from CNN4MAGIC.Generator.models import MobileNetV2_energy_doubleDense
import numpy as np

print('Loading the Neural Network...')
model = MobileNetV2_energy_doubleDense()
model.compile(optimizer='sgd', loss='mse')
model.summary()
print('Model Loaded.')

#%%
BATCH_SIZE = 128
train_gn, val_gn, test_gn, energy_te = load_data_generators(batch_size=BATCH_SIZE, want_energy=True)

# %% Train
EPOCHS = 30

net_name = 'MobileNetV2_energy_doubleDense-900kTrain'
path = '/path/' + net_name + '.hdf5'
check = ModelCheckpoint(filepath=path, save_best_only=True)
clr = OneCycleLR(max_lr=5e-3,
                 num_epochs=EPOCHS,
                 num_samples=len(train_gn),
                 batch_size=BATCH_SIZE)
stop = EarlyStopping(patience=2)

result = model.fit_generator(generator=train_gn,
                             validation_data=val_gn,
                             epochs=EPOCHS,
                             verbose=1,
                             callbacks=[check, clr, stop],
                             use_multiprocessing=True,
                             workers=8
                             )

Which lunched the following error:

[...]
419/7102 [>.............................] - ETA: 50:49 - loss: 0.6377
 420/7102 [>.............................] - ETA: 50:49 - loss: 0.6368
 425/7102 [>.............................] - ETA: 50:53 - loss: 0.6324
 426/7102 [>.............................] - ETA: 50:46 - loss: 0.6314
 443/7102 [>.............................] - ETA: 50:28 - loss: 0.6174
 466/7102 [>.............................] - ETA: 49:54 - loss: 0.59952019-01-08 16:54:36.914175: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure
2019-01-08 16:54:36.914524: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1

Any ideas of what might be causing this issue?

nicolamarinello commented 5 years ago

I'm working with @mrektor on a very similar project, using a different machine and I also experimenting the same kind of problems

ParikhKadam commented 5 years ago

Can you specify your keras, tensorflow, CUDA and cudNN versions?

mrektor commented 5 years ago

@ParikhKadam As is already shown in the post (see the output of tf_env.txt):

Keras 2.2.4 TF 1.11.0 CUDA 9.2 cudNN 7.2.1

nicolamarinello commented 5 years ago

We fixed the problem with a fresh installation of CUDA, cuDNN & Tensorflow

EDIT: the fresh installation has been done on Ubuntu 18.04