Closed Moondra closed 3 years ago
I have the same issue. I've been trying to change batch sizes, but that doesn't seem to change anything.
I think there is bug with Imagedatagenerator. If I load my images from h5py using model.train_on_batch I have no problems.
Same issue here. fit_generator
works fine in 2.0.9, but hangs indefinitely at the end of the first epoch from 2.1.0 onwards.
This is likely due to changes in keras/utils/data_utils.py
between 2.0.9 and 2.1.0. Specifically this: https://github.com/fchollet/keras/commit/612f5307b962fb140106efcc50932c292630fda3#diff-ba9d38600a2df565e5ae8757eb2b1b35
@Dref360 please take a look, this seems like a serious issue.
@moustaki Are you also using flow_from_directory?
Could you all update to master / 2.1.2 please? Pretty sure this has been fixed with : https://github.com/fchollet/keras/commit/2f3edf96078d78450b985bdf3bfffe7e0c627169#diff-299cfd5886683a4b012f286403769fc1
@Dref360 Thanks - just tried both master and 2.1.2 and it indeed fixes the issue. Should have tried that before -- sorry about that! For your earlier question, I am using a custom Sequence sub-class.
I still have this problem with Keras 2.1.2 using tensorflow-gpu 1.4.1. Some advise how to solve it?
NikeNano - make sure that your validation_steps is reasonable. I had a similar problem, but turns out I forgot to divide by batch_size.
same with @NikeNano , using keras 2.1.2 and tensorflow-gpu 1.4.1 and keras freezes on epoch 11
I have the same problem it is stuck on last batch of first epoch. Keras version 2.1.3 Tensorflow version 1.4.0
Epoch 1/30 C:\Users\Minal\AppData\Local\Programs\Python\Python36\lib\site-packages\skimage\transform_warps.py:84: UserWarning: The default mode, 'constant', will be changed to 'reflect' in skimage 0.15. warn("The default mode, 'constant', will be changed to 'reflect' in "
1/6428 [..............................] - ETA: 9:25:55 - loss: 0.0580 2/6428 [..............................] - ETA: 7:46:11 - loss: 0.0560 3/6428 [..............................] - ETA: 7:14:06 - loss: 0.0569 4/6428 [..............................] - ETA: 6:52:54 - loss: 0.0536 5/6428 [..............................] - ETA: 6:49:36 - loss: 0.0541 6/6428 [..............................] - ETA: 6:51:51 - loss: 0.0556 7/6428 [..............................] - ETA: 6:45:15 - loss: 0.0580 8/6428 [..............................] - ETA: 6:33:50 - loss: 0.0595 9/6428 [..............................] - ETA: 6:20:48 - loss: 0.0594 10/6428 [..............................] - ETA: 6:12:55 - loss: 0.0604 11/6428 [..............................] - ETA: 6:07:12 - loss: 0.0596 12/6428 [..............................] - ETA: 6:00:31 - loss: 0.0588 13/6428 [..............................] - ETA: 6:00:06 - loss: 0.0589 14/6428 [..............................] - ETA: 5:59:53 - loss: 0.0591 15/6428 [..............................] - ETA: 5:57:44 - loss: 0.0590 16/6428 [..............................] - ETA: 5:55:21 - loss: 0.0601 . . . 6420/6428 [============================>.] - ETA: 14s - loss: 0.0213 6421/6428 [============================>.] - ETA: 12s - loss: 0.0213 6422/6428 [============================>.] - ETA: 10s - loss: 0.0213 6423/6428 [============================>.] - ETA: 8s - loss: 0.0213 6424/6428 [============================>.] - ETA: 7s - loss: 0.0213 6425/6428 [============================>.] - ETA: 5s - loss: 0.0213 6426/6428 [============================>.] - ETA: 3s - loss: 0.0213 6427/6428 [============================>.] - ETA: 1s - loss: 0.0212
It's solved, It just took so much time in the last batch but then it got to epoch 2
I also have the same issue, where first epoch hangs on the last step. Using the latest Keras, gpu, python 3.5, windows 10
If you are still having this problem, try rebooting, I don't know why but that fixed my issue as I was running keras on the cloud
Hello! I am running into this issue still on Ubuntu running Python 3.5.2 and Keras 2.1.4. I've been waiting a few hours at the end of the first epoch on a very similar issue (Training a transfer binary classifier on VGG19).
At first I thought that it must have been just running through my validation data which was taking an exorbitant amount of time until I found this thread. Is it still a possibility that it is just a very slow iteration over my validation set (it's about 12,000 images, running on a GTX 950)? Or is my mental model of how fit_generator works mistaken?
Also, thanks to all who are maintaining this project! It's been great to work with as I'm beginning to dive deeper into ML. 😄
Update: Found I was using Keras 1 api for fit_generator method, switched to using the Keras 2 api and its working now.
@minaMagedNaeem:same with @oliran, i have the same issue and resolve it after setting validation_steps=validation_size//batch_size
history_ft = model.fit_generator( generator_train,#可自定义 samples_per_epoch=4170, # nb_train_samples
validation_data=generator_test,#可自定义
nb_epoch=10,
# verbose=0,
validation_steps=530//64,
# epochs=100
# nb_val_samples=530
)
same here. i have this problem with the code from Deep Learning with Python Listing 6.37 I am on Ubuntu 18.04 with keras 2.1.6, tensorflow-gpu 1.8.0
I have same issue when I was running Inception V3 to do transfer learning. Windows 10, python 3.5, keras 2.1.6, tensorflow 1.4-gpu
Same here with python3, keras v2.1.6, tensorflow v1.8, ubuntu 18.04 After multiple reinstallations and tries the solution was to wait for several minutes for it to jump to epoch 2/25, after it was stuck on epoch 1 (7999/8000) xD
I had a similar issue with python3, keras v2.1.6, tensorflow v1.8.0, ubuntu 16.04. I interrupted the processing and was able to see that was busy running self.sess.run([self.merged], feed_dict=feed_dict)
in keras/callbacks.py.
I guessed that it was related to histogram computations in TensorBoard. So, I set histogram_freq=0
on TensorBoard
object creation. And, for me it solved the issue, at the cost of loosing TensorBoard histograms.
I had previous versions of keras and tensorflow for which the histogram computation for tensorboard did not take such a huge time (unfortunately I do not recall for which versions it was ok).
Changing validation_steps=validation_size//batch_size worked for me
Experiencing the same with Keras 2.2.0 Tensorflow 1.8 on Ubuntu 16.04 .
Getting stuck here
Experiencing the same with Keras 2.2.0 Tensorflow 1.10 on Ubuntu 16.04 .
Experiencing the same - stuck on the final batch for my CNN!
same. For what it's worth, I think this is a CPU thing, because when I run my code on 1080, it works fine.
Have the same issue. Stuck on first epoch step 1999/2000. Using windows, tensorflow-gpu 1.10.0, Keras 2.2.2, CUDA V9.0.176. Using the ImageDataGenerator flow_from_directory for training and validation
I have way too much data - I have 50 million images and I split it 70 train and 30 val, so I thought it would have way too much validation data to run through every epoch. But if I set validation_steps in fit_generator to 1 it should only do one step of validation (one batch?) before moving on to the next epoch?
I'm new to this so I'm having a hard time debugging, but this is the profile after I few hours:
when sorted by time taken the top two methods are get and wait in pool.py, and the other get is from keras' data_utils.py
Edit: I downgraded Keras to 2.0.9 and now it works Edit: I actually still sometimes have this issue on 2.0.9. Can't seem to find out why it's happening occasionally.
I had this issue with both CPU and GPU, keras 2.2.0. What solved it for me was to set workers=0.
This worked for me:
workers=1
, and use_multiprocessing
=False
in self.keras_model.fit_generator
in model.py
steps_per_epoch
= number of train samples//batch_size
and
validation_steps
= number of validation samples//batch_sizeSame problem for me, and setting multiprocessing to false isn't really a viable solution for me, as as I do lots of thread-heavy pre-processing, and one of the main benefits of using keras.utils.Sequence() is to allow multi processing. Can anyone help please? I'm on Keras 2.2.0.
Same here... However, it works when i remove validation_data=validation_generator
.
I am hitting the same issue with keras 2.2.4, tf 1.11 & cudnn7.3 previously saw the same issue with keras 2.1.3, tf 1.4 & cudnn 7.0.3 and upgraded to latest versions, but the issue persists.
strace shows 2 of the worker processes waiting for read to complete
$sudo strace -p `pidof python | tr ' ' ','`
...
[pid 26167] futex(0x7fdbbcd7c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 25767] futex(0xcd14ad8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26168] read(49, <unfinished ...>
[pid 26169] futex(0x7fdbbcd7c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26170] futex(0x7fdbbcd7c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26171] futex(0x7fdbbcd7c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26172] futex(0x7fdbbcd7c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26173] futex(0x7fdbbcd7c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26174] futex(0x7fdbbcd7c000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26318] read(29, <unfinished ...>
[pid 26319] futex(0x7fdbbcec6000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...>
[pid 26320] futex(0x7fdbbcec6000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff
the same issue is not seen if i use a pure generator instead of inheriting keras.utils.sequence. And it works if multi processing is removed.
removing validation or multiprocessing is not a viable option for me. And want to use sequence based data generation instead of a generater based one to avoid duplicate batches in the epoch.
@fchollet , @Dref360 , looks like many others are also hitting this issue. is this a known issue ? any workarounds to quickly unblock myself ?
to confirm vadapalliravikumar experience, if I remove the generator's inheritance from keras.utils.sequence or use multi-processing=false, it works fine because single-thread. Therefore, it seems like a race condition when multi-processing.
EDIT: I now believe my issue is due to hdf thread safety problems.
I believe I have the same issue. cpu and gpu util both go to zero and nothing happens:
@rgreenblatt sounds as though you have some some investigations. Any suggestions on a viable work around e.g. is there anyway of avoiding using HDF?
In case this isn't common knowledge (I certainly didn't know), pd.read_hdf
isn't thread safe. See https://github.com/pandas-dev/pandas/issues/12236. I wasn't able to find a viable work around.
I was using pandas to load data from hdfs. Originally I experiencing consistent failures at the beginning of second epoch. I wasn't able to resolve this using thread locks or by opening the hdf in a different way. Interestingly, opening the hdf like:
with pd.HDFStore(path_to_hdf_and_start_index[0], 'r') as store:
df = pd.read_hdf(store,
"train_data",
mode='r',
start=path_to_hdf_and_start_index[1],
stop=path_to_hdf_and_start_index[1]+self.batch_size)
results in an error msg while just directly reading the hdf using pd.read_hdf result in no error msg (but still failure). Specifically I get undefined symbol: _Unwind_Resume, version GCC_3.0
.
I then switched to csvs. This caused the failure to occur much later in training and produce no errors, but didn't solve the problem. I think that saving the model using a Callback might be part of the problem, but I have no evidence for this other than when the failures typically occur.
@rgreenblatt thanks for the update - I don't have enough knowledge of underlying formats to help in this case. Looks like we will need to wait for more in-depth assistance.
This happens because you are giving validation data to Keras, through a parameter in model.fit
or model.fit_generator.
After each epoch, Keras takes the validation data and evaluates the model on this data, which implies one forward pass for each validation data point, which might take a lot of time and might seem that Keras is stuck, but it is necessary when training a model.
dividing validation_steps by batch_size solved it for me
Upgrading to the latest version has fixed this for me
updating the keras functions to keras 2 API resolves the issue.
i have updated keras and i am still running into the same issues
I agree that this is still very much an issue. However, depending on your setup there may be a workaround. I think the problem is that the validation_steps parameter is being ignored by keras. Keras is instead using the length returned by the generator to determine how many batches should be run per epoch for the validation set. Since I am using a custom generator, I simply changed the __len__
function to return the value I would have placed as validation_steps
While this workaround works for me, keras should definitely look into resolving this issue
This worked for me:
- set
workers=1
, anduse_multiprocessing
=False
inself.keras_model.fit_generator
inmodel.py
- Make sure that:
steps_per_epoch
= number of train samples//batch_size andvalidation_steps
= number of validation samples//batch_size
This response helped me solve the issue.
Especially, changes to workers
and use_multiprocessing
.
Having the same issue and it should be reproducible. I'm using keras within the latest version of the tensorflow image on nvidia-docker. I'm using a GPU and jupyter notebooks. I'm trying to reproduce the imdb example in Section 6.3.4 of Chollet "Deep Learning with Python." The model goes quickly to 499/500 in the first epoch, then hangs there for 6 minutes before it completes. It eventually finishes all epochs, but takes two hours (6 minutes per epoch, 20 epochs) to do so.
Second Edit: I was able to solve this problem for my use case. My data was coming from numpy memory maps on disk from a 90 GB file. I think that, when the workers were started, it somehow was trying to pickle the data from disk which took a very long time and failed after 10-15 minutes with an error message pickling data that was too big. I still don't understand why it would go through one epoch and fail at the start of the second one, however.
I had to update my Sequence class to not store the numpy arrays, but just file names for my data and I would create a memmap in the __getitem method just for that batch and then clean it up. Once I made this change then everything has been working smoothly and I have been able to train as expected with multiprocessing=True
and 12 workers.
Original post:
My issue was similar, but a little different using fit_generator with a Sequence class for data generation. My training would finish the entire first epoch and freeze on the start of the second epoch. Downgrading to keras 2.2.0 solved the problem for me which is really more of a workaround.
Training with the same model on the same hardware (4x 1080 TI) worked just fine with other training data in the past. The only major different between this working in the past was that I was storing all of my training data in memory and, when I switched to reading the data from disk as batches are loaded, this is when I encountered the error.
Edit: Downgrading did not solve my problem. After a few hours of training I still get freezing on the start of epochs. I ended up switching back to keras 2.2.4 and I added these two lines of code to finally get an error message:
import multiprocessing as mp
mp.set_start_method('spawn', force=True)
and the error message which appeared after sitting for maybe 10 minutes:
File "/home/zach/anaconda3/envs/keras/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/zach/anaconda3/envs/keras/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/zach/anaconda3/envs/keras/lib/python3.6/site-packages/keras/utils/data_utils.py", line 565, in _run
with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
File "/home/zach/anaconda3/envs/keras/lib/python3.6/site-packages/keras/utils/data_utils.py", line 548, in <lambda>
initargs=(seqs,))
File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/context.py", line 119, in Pool
context=self.get_context())
File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/pool.py", line 174, in __init__
self._repopulate_pool()
File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/pool.py", line 239, in _repopulate_pool
w.start()
File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/home/zach/anaconda3/envs/keras/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB
It looks like what is happening is that my binary data on disk (90 GB) which was a numpy memmap is being pickled to child processes which is not going to work.
I'm still curious why it works for the first epoch and fails on the second.
On Keras 2.2.4 I noticed that if I remove the validation_data generator argument from the fit_generator() call it does get past. I haven't investigated yet if it is a bug on my side or not. Hope this helps.
What I have done is adjusting the fit_generator function such that it takes an additional parameter use_validation_multiprocessing
and checks for it here and here, instead of using the global use_mulitprocessing boolean
. This has resolved the problem for me, but it's very hacky.
Additionally, I am not sure if there should be a check for use_multiprocessing
here, and here? Only checking for the worker count to instatiate Enqueuer
s seems incorrect if you have use_multiprocessing = False
?
I encountered this problem using the fit
function. I believe I fixed it by setting batch_size=2
and using Adam
instead of SGD
as my optimizer. I think it may be a memory issue, and the machine was coping using swap memory which is notoriously slow.
I confirm the valid_generator was the problem. The problem was gone after I had turned it off. But if the validation set is big, I still need the method. I would appreciate if the Keras team can help with this!
I'm using Keras 2.1.1 and Tensorflow 1.4, Python 3.6, Windows 7.
I'm attempting transfer learning using the Inception model. The code is straight from the Keras Application API, just a few tweaks (using my data).
Here is the code