fit_generator Segmentation fault

keras-team / keras

Deep Learning for humans

http://keras.io/

Apache License 2.0

61.92k stars 19.46k forks source link

fit_generator Segmentation fault #8225

Closed benoistlaurent closed 3 years ago

benoistlaurent commented 7 years ago

Hi,

I use model.fit_generator to handle a large dataset.

I want to read data by batch from a source file, which I did successfully using a CSV file.

When I want to use pandas.read_hdf function, keras fit_generator ends-up with a segmentation fault:

$ python 3_wine_net_fit_generator_hdf.py                                                                                                           [16:40:46]
Using TensorFlow backend.
2017-10-23 16:43:11.345428: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-23 16:43:11.345450: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-23 16:43:11.345455: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-23 16:43:11.345476: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Epoch 1/10
518/519 [============================>.] - ETA: 0s - loss: 0.2543 - acc: 0.9116[1]    79974 segmentation fault  python 3_wine_net_fit_generator_hdf.py

I already noticed that if I do not use validation_data, I don't get the segmentation fault but I don't understand why.

Here is a link to the small example I'm running: wine-example

Any help would be very much appreciated.

Cheers, Ben

dgorissen commented 6 years ago

Did you solve this?

I have the same problem (also using fit_generator) but during the epoch, consistently within the first or first 5 epochs. Turns out versions before 2.0.9 are fine, only 2.0.9 shows this behaviour. Running on Tensorflow 1.0.1.

Training model
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)
Epoch 1/20
7882/7884 [============================>.] - ETA: 0s - loss: 0.1092 - categorical_accuracy: 0.9744Epoch 00001: val_categorical_accuracy improved from -inf to 0.99166, saving model to /home/xxx/code7884/7884 [==============================] - 410s 52ms/step - loss: 0.1092 - categorical_accuracy: 0.9744 - val_loss: 0.0319 - val_categorical_accuracy: 0.9917
Epoch 2/20
7882/7884 [============================>.] - ETA: 0s - loss: 0.0438 - categorical_accuracy: 0.9893Epoch 00002: val_categorical_accuracy improved from 0.99166 to 0.99559, saving model to /home/xxx/c7884/7884 [==============================] - 410s 52ms/step - loss: 0.0438 - categorical_accuracy: 0.9893 - val_loss: 0.0151 - val_categorical_accuracy: 0.9956
Epoch 3/20
5925/7884 [=====================>........] - ETA: 1:38 - loss: 0.0342 - categorical_accuracy: 0.9917Segmentation fault (core dumped)

benoistlaurent commented 6 years ago

I still got the problem on this version of the script (using tensorflow==1.3.0 and Keras==2.0.8).

The solution I ended up with is to stop using model.fit_generator and replace it with model.train_on_batch (see this example)

rdelassus commented 6 years ago

I just had the exact same problem. I use fit_generator and the generator is reading data from a file.

The first epoch ends with a segmentation fault : Epoch 1/35 2694/2695 [============================>.] - ETA: 0s - loss: 0.2397 - acc: 0.9052 - jaccard_coef: 0.4410 - jaccard_coef_int: 0.5470Erreur de segmentation (core dumped)

It doesn't happen when I remove the validation_data.

With keras 1.2.2 and tensorflow 1.4.0

oxydron commented 6 years ago

Same problem keras: 2.0.8 tensorflow: 1.3.0 Training data: 1048 Test data: 259 image type: 120x120x1

mdgoldberg commented 6 years ago

Same problem. @fchollet is there any way this could be fixed? I'm happy to help provide debug info and potentially contribute as needed.

Using fit_generator, with generator arguments for both training and validation data. Both train and validation generators read from the same HDF5 file (using pandas).

I've tried it with my full dataset (237K rows) and a sample subset of the full dataset (1000 rows), both with ~1K columns, and in both cases the segmentation fault happens right after the first epoch finishes. Like others, if I remove the validation data it doesn't occur. I'm using a train/test split of 85/15 and a batch size of 64 for both the full and sample datasets (so I'm only reading 64 rows from the HDF5 file at any given time, in the generator). Output from top confirms that I'm not running out of memory.

Versions: Keras 2.1.6 tensorflow 1.8.0

Unlike @dgorissen, I'm experiencing this issue on 2.0.8 as well as 2.1.6.

mdgoldberg commented 6 years ago

I believe I actually just figured out what was causing my personal issue. Not sure if this will apply to others, but in my generator I was using pd.read_hdf to read subsets of an HDF5 file into memory, but the problem is that read_hdf is not thread-safe, even for reading (documentation is currently not clear about this).

I solved this problem by passing workers=0 to fit_generator, so that the generator is executed on the main thread.

amarion35 commented 6 years ago

I had the same problem with the following code

def init_model():
    model = Sequential()
    model.add(Conv3D(4, kernel_size=(3, 3, 3), input_shape=(None,None,None,1), padding='same'))
    model.add(Activation('relu'))
    model.add(MaxPooling3D(pool_size=(3, 3, 3), padding='same'))
    model.add(Dropout(0.25))

    model.add(GlobalAveragePooling3D())
    model.add(Dense(32, activation='sigmoid'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='softmax'))
    model.compile(loss='mse', optimizer=Adam())

    print(model.summary())

    return model

def input_generator(metas):
    while True:
        meta_sample = metas.sample(frac=1)
        yield np.expand_dims(np.expand_dims(load_from_meta(meta_sample), axis=4), axis=0), [meta_sample.DMOS]

video_metadatas = get_datas().iloc[0]
model = init_model()
hist = model.fit_generator(generator=input_generator(video_metadatas), epochs=1, steps_per_epoch=1, use_multiprocessing=False, workers=0)

load_from_meta() loads videos using a ffmpeg wrapper

I fixed the issue with workers=0

Edit:

Actually it does not work every time

$ for k in 1 2 3 4 5 6 7 8 9 10; do python3 3D_CNN.py; done
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Using TensorFlow backend.
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Using TensorFlow backend.
Segmentation fault (core dumped)
Using TensorFlow backend.
Segmentation fault (core dumped)

gschramm commented 6 years ago

I experience the same problem (segmentation fault during the first epochs when using fit_generator).

The segmentation fault occurs when I run fit_generator on CPUs with a batch size of 40 It does not occur when I run the same example (see below) on a GPU (GTX 1080 Ti) or when running on CPU with a batch size of 10. I was able to reproduce the segmentation faults on two linux machines.

4/10 [===========>..................] - ETA: 10:59 - loss: 0.4438Segmentation fault (core dumped)

Here is a small standalone script that produces the segmentation fault (when using batch_size = 40 and run on CPUs): https://gist.github.com/gschramm/e6db1f7333b50bca10c38243efec0925

Any idea what is going wrong?

I am running:

OS: Ubuntu 16.04
Keras 2.2.0
Tensorflow 1.9.0 (I also tested 1.8.0 -> same problem)

luncliff commented 5 years ago

Hi, I just met this symptom in my Docker environment with Keras 2.2.4 and Tensorflow 1.12 (GPU).

For me, the issue disappeared when I changed Tensorflow to 1.13-gpu-py3 I'm not sure about it is solved completely, but writing my environment for future visitors...

Environment

Host:
- OS: Ubuntu 16.04.6 LTS
- Kernel Linux 4.15.0-51-generic
Docker:
- Docker version: 18.09.6
- API version: 1.39
- NVIDIA Docker Runtime (nvidia-docker2)
GPU Device
- CUDA: 9.2
- GeForce RTX 2080 Ti * 2

Mon Jun 17 17:23:44 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 40%   39C    P8    21W / 250W |    403MiB / 10989MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:02:00.0 Off |                  N/A |
| 37%   34C    P8     2W / 250W |      1MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1114      G   /usr/lib/xorg/Xorg                           224MiB |
|    0      2511      G   compiz                                        79MiB |
|    0     10650      G   ...-token=9E760AB97E59CC5C02D0AFC5D37FE54E    98MiB |
+-----------------------------------------------------------------------------+

Installation

Keras:
- Installed via pip
Tensorflow:
- Docker Image: 1.12-gpu-py3 >> 1.13-gpu-py3(solved)

FROM    tensorflow/tensorflow:1.13.1-gpu-py3 as ship
LABEL   maintainer="luncliff@gmail.com"

RUN     pip install -qqq --upgrade pip && pip install -qqq keras
RUN     pip install -qqq pillow
# ...

mikechen66 commented 3 years ago

For the larger datasets with Keras multithreading, users needs to adopt a threadsafe generator method to deal with the issue. There is a brief introdcution by Anand Chitipothu as well as the explaination of composed functions by Mathieu Larose. The threadsafe method has been adopted in the library of Faster RCNN by RGB and Kaiming He.

threadsafe_code: http://anandology.com/blog/using-iterators-and-generators/ composition of functions: https://mathieularose.com/function-composition-in-python/