keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.96k stars 19.46k forks source link

Keras freezing on last batch of first epoch (can't move to second epoch) #8595

Closed Moondra closed 3 years ago

Moondra commented 6 years ago

I'm using Keras 2.1.1 and Tensorflow 1.4, Python 3.6, Windows 7.

I'm attempting transfer learning using the Inception model. The code is straight from the Keras Application API, just a few tweaks (using my data).

Here is the code


from keras.preprocessing import image```
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Model
from keras.layers import Dense, GlobalAveragePooling2D
from keras import backend as K
from keras import optimizers

img_width, img_height = 299, 299
train_data_dir = r'C:\Users\Moondra\Desktop\Keras Applications\data\train'
total_samples = 13581
batch_size = 3
epochs = 5

train_datagen = ImageDataGenerator(
rescale = 1./255,
horizontal_flip = True,
zoom_range = 0.1,
rotation_range=15)

train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size = (img_height, img_width),
batch_size = batch_size, 
class_mode = 'categorical')  #class_mode = 'categorical'

# create the base pre-trained model
base_model = InceptionV3(weights='imagenet', include_top=False)

# add a global spatial average pooling layer
x = base_model.output
x = GlobalAveragePooling2D()(x)
# let's add a fully-connected layer
x = Dense(1024, activation='relu')(x)
# and a logistic layer -- let's say we have 200 classes
predictions = Dense(12, activation='softmax')(x)

# this is the model we will train
model = Model(input=base_model.input, output=predictions)

# first: train only the top layers (which were randomly initialized)
# i.e. freeze all convolutional InceptionV3 layers
for layer in base_model.layers:
    layer.trainable = False

# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer=optimizers.SGD(lr=0.0001, momentum=0.9), loss='categorical_crossentropy', metrics = ['accuracy'])

# train the model on the new data for a few epochs
model.fit_generator(
train_generator,
steps_per_epoch = 20,
epochs = epochs)

# at this point, the top layers are well trained and we can start fine-tuning
# convolutional layers from inception V3. We will freeze the bottom N layers
# and train the remaining top layers.

# let's visualize layer names and layer indices to see how many layers
# we should freeze:
for i, layer in enumerate(base_model.layers):
   print(i, layer.name)

# we chose to train the top 2 inception blocks, i.e. we will freeze
# the first 172 layers and unfreeze the rest:
for layer in model.layers[:249]:
   layer.trainable = False
for layer in model.layers[249:]:
   layer.trainable = True

# we need to recompile the model for these modifications to take effect
# we use SGD with a low learning rate
from keras.optimizers import SGD
model.compile(optimizer=SGD(lr=0.0001, momentum=0.9), loss='categorical_crossentropy', metrics = ['accuracy'])

# we train our model again (this time fine-tuning the top 2 inception blocks
# alongside the top Dense layers
model.fit_generator(
train_generator,
steps_per_epoch = 25,
epochs = epochs)`

Output is

Found 13581 images belonging to 12 classes.

Warning (from warnings module):
  File "C:\Users\Moondra\Desktop\Keras Applications\keras_transfer_learning_inception_problem_one_epoch.py", line 44
    model = Model(input=base_model.input, output=predictions)
UserWarning: Update your `Model` call to the Keras 2 API: `Model(inputs=Tensor("in..., outputs=Tensor("de...)`
Epoch 1/5

 1/20 [>.............................] - ETA: 38s - loss: 2.8652 - acc: 0.0000e+00
 3/20 [===>..........................] - ETA: 12s - loss: 2.6107 - acc: 0.1111    
 4/20 [=====>........................] - ETA: 8s - loss: 2.6454 - acc: 0.0833 
 5/20 [======>.......................] - ETA: 6s - loss: 2.6483 - acc: 0.0667
 6/20 [========>.....................] - ETA: 5s - loss: 2.6863 - acc: 0.0556
 7/20 [=========>....................] - ETA: 4s - loss: 2.6230 - acc: 0.0952
 8/20 [===========>..................] - ETA: 3s - loss: 2.6212 - acc: 0.0833
 9/20 [============>.................] - ETA: 3s - loss: 2.6192 - acc: 0.1111
10/20 [==============>...............] - ETA: 2s - loss: 2.6223 - acc: 0.1000
11/20 [===============>..............] - ETA: 2s - loss: 2.6626 - acc: 0.0909
12/20 [=================>............] - ETA: 2s - loss: 2.6562 - acc: 0.1111
13/20 [==================>...........] - ETA: 1s - loss: 2.6436 - acc: 0.1282
14/20 [====================>.........] - ETA: 1s - loss: 2.6319 - acc: 0.1190
15/20 [=====================>........] - ETA: 1s - loss: 2.6343 - acc: 0.1111
Warning (from warnings module):
  File "C:\Users\Moondra\AppData\Local\Programs\Python\Python36\lib\site-packages\keras\callbacks.py", line 116
    % delta_t_median)
UserWarning: Method on_batch_end() is slow compared to the batch update (0.102000). Check your callbacks.

16/20 [=======================>......] - ETA: 0s - loss: 2.6310 - acc: 0.1042
17/20 [========================>.....] - ETA: 0s - loss: 2.6207 - acc: 0.1176
18/20 [==========================>...] - ETA: 0s - loss: 2.6063 - acc: 0.1296
19/20 [===========================>..] - ETA: 0s - loss: 2.6056 - acc: 0.1228

It just hangs at the 19/20.

I already asked on stack overflow but no help.

https://stackoverflow.com/questions/47382952/cant-get-past-first-epoch-just-hangs-keras-transfer-learning-inception
whatisAI commented 6 years ago

I have the same issue. I've been trying to change batch sizes, but that doesn't seem to change anything.

moondra2017 commented 6 years ago

I think there is bug with Imagedatagenerator. If I load my images from h5py using model.train_on_batch I have no problems.

moustaki commented 6 years ago

Same issue here. fit_generator works fine in 2.0.9, but hangs indefinitely at the end of the first epoch from 2.1.0 onwards.

fchollet commented 6 years ago

This is likely due to changes in keras/utils/data_utils.py between 2.0.9 and 2.1.0. Specifically this: https://github.com/fchollet/keras/commit/612f5307b962fb140106efcc50932c292630fda3#diff-ba9d38600a2df565e5ae8757eb2b1b35

@Dref360 please take a look, this seems like a serious issue.

Dref360 commented 6 years ago

@moustaki Are you also using flow_from_directory?

Dref360 commented 6 years ago

Could you all update to master / 2.1.2 please? Pretty sure this has been fixed with : https://github.com/fchollet/keras/commit/2f3edf96078d78450b985bdf3bfffe7e0c627169#diff-299cfd5886683a4b012f286403769fc1

moustaki commented 6 years ago

@Dref360 Thanks - just tried both master and 2.1.2 and it indeed fixes the issue. Should have tried that before -- sorry about that! For your earlier question, I am using a custom Sequence sub-class.

NikeNano commented 6 years ago

I still have this problem with Keras 2.1.2 using tensorflow-gpu 1.4.1. Some advise how to solve it?

oliran commented 6 years ago

NikeNano - make sure that your validation_steps is reasonable. I had a similar problem, but turns out I forgot to divide by batch_size.

LivingProgram commented 6 years ago

same with @NikeNano , using keras 2.1.2 and tensorflow-gpu 1.4.1 and keras freezes on epoch 11

minaMagedNaeem commented 6 years ago

I have the same problem it is stuck on last batch of first epoch. Keras version 2.1.3 Tensorflow version 1.4.0

Epoch 1/30 C:\Users\Minal\AppData\Local\Programs\Python\Python36\lib\site-packages\skimage\transform_warps.py:84: UserWarning: The default mode, 'constant', will be changed to 'reflect' in skimage 0.15. warn("The default mode, 'constant', will be changed to 'reflect' in "

1/6428 [..............................] - ETA: 9:25:55 - loss: 0.0580 2/6428 [..............................] - ETA: 7:46:11 - loss: 0.0560 3/6428 [..............................] - ETA: 7:14:06 - loss: 0.0569 4/6428 [..............................] - ETA: 6:52:54 - loss: 0.0536 5/6428 [..............................] - ETA: 6:49:36 - loss: 0.0541 6/6428 [..............................] - ETA: 6:51:51 - loss: 0.0556 7/6428 [..............................] - ETA: 6:45:15 - loss: 0.0580 8/6428 [..............................] - ETA: 6:33:50 - loss: 0.0595 9/6428 [..............................] - ETA: 6:20:48 - loss: 0.0594 10/6428 [..............................] - ETA: 6:12:55 - loss: 0.0604 11/6428 [..............................] - ETA: 6:07:12 - loss: 0.0596 12/6428 [..............................] - ETA: 6:00:31 - loss: 0.0588 13/6428 [..............................] - ETA: 6:00:06 - loss: 0.0589 14/6428 [..............................] - ETA: 5:59:53 - loss: 0.0591 15/6428 [..............................] - ETA: 5:57:44 - loss: 0.0590 16/6428 [..............................] - ETA: 5:55:21 - loss: 0.0601 . . . 6420/6428 [============================>.] - ETA: 14s - loss: 0.0213 6421/6428 [============================>.] - ETA: 12s - loss: 0.0213 6422/6428 [============================>.] - ETA: 10s - loss: 0.0213 6423/6428 [============================>.] - ETA: 8s - loss: 0.0213 6424/6428 [============================>.] - ETA: 7s - loss: 0.0213 6425/6428 [============================>.] - ETA: 5s - loss: 0.0213 6426/6428 [============================>.] - ETA: 3s - loss: 0.0213 6427/6428 [============================>.] - ETA: 1s - loss: 0.0212

minaMagedNaeem commented 6 years ago

It's solved, It just took so much time in the last batch but then it got to epoch 2

KenHollandWHY commented 6 years ago

I also have the same issue, where first epoch hangs on the last step. Using the latest Keras, gpu, python 3.5, windows 10

LivingProgram commented 6 years ago

If you are still having this problem, try rebooting, I don't know why but that fixed my issue as I was running keras on the cloud

JackCurrie commented 6 years ago

Hello! I am running into this issue still on Ubuntu running Python 3.5.2 and Keras 2.1.4. I've been waiting a few hours at the end of the first epoch on a very similar issue (Training a transfer binary classifier on VGG19).

At first I thought that it must have been just running through my validation data which was taking an exorbitant amount of time until I found this thread. Is it still a possibility that it is just a very slow iteration over my validation set (it's about 12,000 images, running on a GTX 950)? Or is my mental model of how fit_generator works mistaken?

Also, thanks to all who are maintaining this project! It's been great to work with as I'm beginning to dive deeper into ML. 😄

Update: Found I was using Keras 1 api for fit_generator method, switched to using the Keras 2 api and its working now.

kaka7 commented 6 years ago

@minaMagedNaeem:same with @oliran, i have the same issue and resolve it after setting validation_steps=validation_size//batch_size

history_ft = model.fit_generator( generator_train,#可自定义 samples_per_epoch=4170, # nb_train_samples

steps_per_epoch=10, # nb_train_samples#每轮epoch遍历的samples

validation_data=generator_test,#可自定义
nb_epoch=10,
# verbose=0,
validation_steps=530//64,
# epochs=100
# nb_val_samples=530

)

ptah23 commented 6 years ago

same here. i have this problem with the code from Deep Learning with Python Listing 6.37 I am on Ubuntu 18.04 with keras 2.1.6, tensorflow-gpu 1.8.0

Tensorfengsheng1926 commented 6 years ago

I have same issue when I was running Inception V3 to do transfer learning. Windows 10, python 3.5, keras 2.1.6, tensorflow 1.4-gpu

hashJoe commented 6 years ago

Same here with python3, keras v2.1.6, tensorflow v1.8, ubuntu 18.04 After multiple reinstallations and tries the solution was to wait for several minutes for it to jump to epoch 2/25, after it was stuck on epoch 1 (7999/8000) xD

ldelphinpoulat commented 6 years ago

I had a similar issue with python3, keras v2.1.6, tensorflow v1.8.0, ubuntu 16.04. I interrupted the processing and was able to see that was busy running self.sess.run([self.merged], feed_dict=feed_dict) in keras/callbacks.py. I guessed that it was related to histogram computations in TensorBoard. So, I set histogram_freq=0 on TensorBoard object creation. And, for me it solved the issue, at the cost of loosing TensorBoard histograms. I had previous versions of keras and tensorflow for which the histogram computation for tensorboard did not take such a huge time (unfortunately I do not recall for which versions it was ok).

shaktisd commented 6 years ago

Changing validation_steps=validation_size//batch_size worked for me

whatdhack commented 6 years ago

Experiencing the same with Keras 2.2.0 Tensorflow 1.8 on Ubuntu 16.04 .

bmitrauncc commented 6 years ago
screen shot 2018-06-20 at 16 39 46

Getting stuck here

yangjh39 commented 6 years ago

Experiencing the same with Keras 2.2.0 Tensorflow 1.10 on Ubuntu 16.04 .

kjaisingh commented 6 years ago

Experiencing the same - stuck on the final batch for my CNN!

ejcer commented 6 years ago

same. For what it's worth, I think this is a CPU thing, because when I run my code on 1080, it works fine.

dantheman3333 commented 6 years ago

Have the same issue. Stuck on first epoch step 1999/2000. Using windows, tensorflow-gpu 1.10.0, Keras 2.2.2, CUDA V9.0.176. Using the ImageDataGenerator flow_from_directory for training and validation

I have way too much data - I have 50 million images and I split it 70 train and 30 val, so I thought it would have way too much validation data to run through every epoch. But if I set validation_steps in fit_generator to 1 it should only do one step of validation (one batch?) before moving on to the next epoch?

I'm new to this so I'm having a hard time debugging, but this is the profile after I few hours: call_count time

when sorted by time taken the top two methods are get and wait in pool.py, and the other get is from keras' data_utils.py

Edit: I downgraded Keras to 2.0.9 and now it works Edit: I actually still sometimes have this issue on 2.0.9. Can't seem to find out why it's happening occasionally.

MinnML commented 6 years ago

I had this issue with both CPU and GPU, keras 2.2.0. What solved it for me was to set workers=0.

ashuta03 commented 6 years ago

This worked for me:

  1. set workers=1, and use_multiprocessing=False in self.keras_model.fit_generator in model.py
  2. Make sure that: steps_per_epoch = number of train samples//batch_size and validation_steps = number of validation samples//batch_size
Stevod commented 6 years ago

Same problem for me, and setting multiprocessing to false isn't really a viable solution for me, as as I do lots of thread-heavy pre-processing, and one of the main benefits of using keras.utils.Sequence() is to allow multi processing. Can anyone help please? I'm on Keras 2.2.0.

jdroenner commented 6 years ago

Same here... However, it works when i remove validation_data=validation_generator.

vadapalliravikumar commented 6 years ago

I am hitting the same issue with keras 2.2.4, tf 1.11 & cudnn7.3 previously saw the same issue with keras 2.1.3, tf 1.4 & cudnn 7.0.3 and upgraded to latest versions, but the issue persists.

strace shows 2 of the worker processes waiting for read to complete

$sudo strace -p `pidof python | tr ' ' ','`
...
[pid 26167] futex(0x7fdbbcd7c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 25767] futex(0xcd14ad8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26168] read(49,  <unfinished ...>
[pid 26169] futex(0x7fdbbcd7c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26170] futex(0x7fdbbcd7c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26171] futex(0x7fdbbcd7c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26172] futex(0x7fdbbcd7c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26173] futex(0x7fdbbcd7c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26174] futex(0x7fdbbcd7c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26318] read(29,  <unfinished ...>
[pid 26319] futex(0x7fdbbcec6000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26320] futex(0x7fdbbcec6000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff

the same issue is not seen if i use a pure generator instead of inheriting keras.utils.sequence. And it works if multi processing is removed.

removing validation or multiprocessing is not a viable option for me. And want to use sequence based data generation instead of a generater based one to avoid duplicate batches in the epoch.

@fchollet , @Dref360 , looks like many others are also hitting this issue. is this a known issue ? any workarounds to quickly unblock myself ?

Stevod commented 6 years ago

to confirm vadapalliravikumar experience, if I remove the generator's inheritance from keras.utils.sequence or use multi-processing=false, it works fine because single-thread. Therefore, it seems like a race condition when multi-processing.

rgreenblatt commented 5 years ago

EDIT: I now believe my issue is due to hdf thread safety problems.

I believe I have the same issue. cpu and gpu util both go to zero and nothing happens:

image

Stevod commented 5 years ago

@rgreenblatt sounds as though you have some some investigations. Any suggestions on a viable work around e.g. is there anyway of avoiding using HDF?

rgreenblatt commented 5 years ago

In case this isn't common knowledge (I certainly didn't know), pd.read_hdf isn't thread safe. See https://github.com/pandas-dev/pandas/issues/12236. I wasn't able to find a viable work around.

I was using pandas to load data from hdfs. Originally I experiencing consistent failures at the beginning of second epoch. I wasn't able to resolve this using thread locks or by opening the hdf in a different way. Interestingly, opening the hdf like:

with pd.HDFStore(path_to_hdf_and_start_index[0], 'r') as store:
                df = pd.read_hdf(store,
                "train_data",
                mode='r',
                start=path_to_hdf_and_start_index[1],
                stop=path_to_hdf_and_start_index[1]+self.batch_size)

results in an error msg while just directly reading the hdf using pd.read_hdf result in no error msg (but still failure). Specifically I get undefined symbol: _Unwind_Resume, version GCC_3.0.

I then switched to csvs. This caused the failure to occur much later in training and produce no errors, but didn't solve the problem. I think that saving the model using a Callback might be part of the problem, but I have no evidence for this other than when the failures typically occur.

Stevod commented 5 years ago

@rgreenblatt thanks for the update - I don't have enough knowledge of underlying formats to help in this case. Looks like we will need to wait for more in-depth assistance.

setsometso commented 5 years ago

This happens because you are giving validation data to Keras, through a parameter in model.fit or model.fit_generator.

After each epoch, Keras takes the validation data and evaluates the model on this data, which implies one forward pass for each validation data point, which might take a lot of time and might seem that Keras is stuck, but it is necessary when training a model.

Barthold-Albrecht commented 5 years ago

dividing validation_steps by batch_size solved it for me

Stevod commented 5 years ago

Upgrading to the latest version has fixed this for me

vikitripathi commented 5 years ago

updating the keras functions to keras 2 API resolves the issue.

jntorres commented 5 years ago

i have updated keras and i am still running into the same issues

macmatt22 commented 5 years ago

I agree that this is still very much an issue. However, depending on your setup there may be a workaround. I think the problem is that the validation_steps parameter is being ignored by keras. Keras is instead using the length returned by the generator to determine how many batches should be run per epoch for the validation set. Since I am using a custom generator, I simply changed the __len__ function to return the value I would have placed as validation_steps

While this workaround works for me, keras should definitely look into resolving this issue

aminzabardast commented 5 years ago

This worked for me:

  1. set workers=1, and use_multiprocessing=False in self.keras_model.fit_generator in model.py
  2. Make sure that: steps_per_epoch = number of train samples//batch_size and validation_steps = number of validation samples//batch_size

This response helped me solve the issue. Especially, changes to workers and use_multiprocessing.

moodymq commented 5 years ago

Having the same issue and it should be reproducible. I'm using keras within the latest version of the tensorflow image on nvidia-docker. I'm using a GPU and jupyter notebooks. I'm trying to reproduce the imdb example in Section 6.3.4 of Chollet "Deep Learning with Python." The model goes quickly to 499/500 in the first epoch, then hangs there for 6 minutes before it completes. It eventually finishes all epochs, but takes two hours (6 minutes per epoch, 20 epochs) to do so.

znorman-harris commented 5 years ago

Second Edit: I was able to solve this problem for my use case. My data was coming from numpy memory maps on disk from a 90 GB file. I think that, when the workers were started, it somehow was trying to pickle the data from disk which took a very long time and failed after 10-15 minutes with an error message pickling data that was too big. I still don't understand why it would go through one epoch and fail at the start of the second one, however.

I had to update my Sequence class to not store the numpy arrays, but just file names for my data and I would create a memmap in the __getitem method just for that batch and then clean it up. Once I made this change then everything has been working smoothly and I have been able to train as expected with multiprocessing=True and 12 workers.


Original post:

My issue was similar, but a little different using fit_generator with a Sequence class for data generation. My training would finish the entire first epoch and freeze on the start of the second epoch. Downgrading to keras 2.2.0 solved the problem for me which is really more of a workaround.

Training with the same model on the same hardware (4x 1080 TI) worked just fine with other training data in the past. The only major different between this working in the past was that I was storing all of my training data in memory and, when I switched to reading the data from disk as batches are loaded, this is when I encountered the error.

Edit: Downgrading did not solve my problem. After a few hours of training I still get freezing on the start of epochs. I ended up switching back to keras 2.2.4 and I added these two lines of code to finally get an error message:

import multiprocessing as mp
mp.set_start_method('spawn', force=True)

and the error message which appeared after sitting for maybe 10 minutes:

  File "/home/zach/anaconda3/envs/keras/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/zach/anaconda3/envs/keras/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/zach/anaconda3/envs/keras/lib/python3.6/site-packages/keras/utils/data_utils.py", line 565, in _run
    with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
  File "/home/zach/anaconda3/envs/keras/lib/python3.6/site-packages/keras/utils/data_utils.py", line 548, in <lambda>
    initargs=(seqs,))
  File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/context.py", line 119, in Pool
    context=self.get_context())
  File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/pool.py", line 174, in __init__
    self._repopulate_pool()
  File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/pool.py", line 239, in _repopulate_pool
    w.start()
  File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB

It looks like what is happening is that my binary data on disk (90 GB) which was a numpy memmap is being pickled to child processes which is not going to work.

I'm still curious why it works for the first epoch and fails on the second.

RazvanPasca commented 5 years ago

On Keras 2.2.4 I noticed that if I remove the validation_data generator argument from the fit_generator() call it does get past. I haven't investigated yet if it is a bug on my side or not. Hope this helps.

FranzHahn commented 5 years ago

What I have done is adjusting the fit_generator function such that it takes an additional parameter use_validation_multiprocessing and checks for it here and here, instead of using the global use_mulitprocessing boolean. This has resolved the problem for me, but it's very hacky.

Additionally, I am not sure if there should be a check for use_multiprocessing here, and here? Only checking for the worker count to instatiate Enqueuers seems incorrect if you have use_multiprocessing = False?

Quetzalcohuatl commented 5 years ago

I encountered this problem using the fit function. I believe I fixed it by setting batch_size=2 and using Adam instead of SGD as my optimizer. I think it may be a memory issue, and the machine was coping using swap memory which is notoriously slow.

mnguyenmti commented 5 years ago

I confirm the valid_generator was the problem. The problem was gone after I had turned it off. But if the validation set is big, I still need the method. I would appreciate if the Keras team can help with this!