Multiprocessing using fit_generator(pickle_safe=True) fails

dofuuz commented 7 years ago

I'm trying to use fit_generator to seperate data loader from trainer.

model.fit_generator(data_gen(), samples_per_epoch=10000, nb_epoch=1, pickle_safe=True, verbose=0)

Excuting this code produces error like below:

Traceback (most recent call last):
  File "main_generator.py", line 138, in <module>
    model.fit_generator(data_gen(), samples_per_epoch=10000, nb_epoch=1, pickle_safe=True, verbose=0)
  File "C:\dev\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\keras\models.py", line 935, in fit_generator
    initial_epoch=initial_epoch)
  File "C:\dev\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\keras\engine\training.py", line 1470, in fit_generator
    pickle_safe=pickle_safe)
  File "C:\dev\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\site-packages\keras\engine\training.py", line 436, in generator_queue
    thread.start()
  File "C:\dev\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\dev\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\multiprocessing\context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\dev\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\multiprocessing\context.py", line 313, in _Popen
    return Popen(process_obj)
  File "C:\dev\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\multiprocessing\popen_spawn_win32.py", line 66, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\dev\WinPython-64bit-3.5.2.3Qt5\python-3.5.2.amd64\lib\multiprocessing\reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'generator_queue.<locals>.data_generator_task'

I also tried to test keras/tests/keras/test_multiprocessing.py, but it failed.

Here is output for test_multiprocessing.py: test_multiprocessing.faillog.txt

Is it bug of Keras itself? Any fixes available?

eliafrigieri commented 7 years ago

Same problem. Any update?

isaacgerg commented 7 years ago

Is this the same as #5510? If so, I believe this is a windows error right now. What OS are you using?

isaacgerg commented 7 years ago

@eliafrigieri are you using windows?

eliafrigieri commented 7 years ago

I'm using: windows 10, python 2.7.12, keras 1.0.8 with theano.

isaacgerg commented 7 years ago

@eliafrigieri @DofuUZ I am starting to believe this is a windows issue. I get the same issue with tf and python 3.5. Good to know it manifests with python 2.7.

ciararogerson commented 7 years ago

I can confirm the same issue. Windows 10, python 3.5, keras with theano.

isaacgerg commented 7 years ago

Its been 6 years since I've mucked with python multiprocess and it was python 3.4. Im happy to contribute experience. Can anyone else offer assistance?

eliafrigieri commented 7 years ago

I've made my own solution, avoiding this issue. Basically rewriting the function that creates, populates and returns the queue and the stopping event (generator_queue into training.py)

ciararogerson commented 7 years ago

@eliafrigieri Any chance you could post here? It would be much appreciated. I've been trying out a load of things over the past couple of days, to no avail...

eliafrigieri commented 7 years ago

It depends, what is your task?

ciararogerson commented 7 years ago

I've got a large dataset and am trying to speed up training time on this task here: https://www.kaggle.com/c/data-science-bowl-2017. The windows issues with multiprocessing are proving quite painful. Edit: if what you've got is sensitive then no worries, you don't need to post it. I was only asking on the off chance that it was something inconsequential.

eliafrigieri commented 7 years ago

I have large dataset too, I've added some fuctions that load "batch-size" images and put into the queue. For example if you have 1000 images, you can split into 10 groups of 100 images each and launch 10 process for parallel loading; then only one process get from queue and call "train-on-batch". I'm not posting code, because it is too badly written and it is not defenitively version for my task (probaly I will change code every day in the next two week, only for parallel loading to increase the speed)

isaacgerg commented 7 years ago

@eliafrigieri Very cool@ Are you able to share?

ciararogerson commented 7 years ago

I was trying all sorts of options and I think I did something similar to you, @eliafrigieri . I was trying to create an external data_generator class that would use multiprocessing to populate a queue. The class was an iterator, so the queue would be accessed via next(). It's part of a larger code base, but I extracted an example here (attached data_generator_sample_main.txt data_generator_sample.txt

).

The problem I was having was that each multiprocessing pool imported keras, so I was getting all sorts of CNMEM warnings and everything looked like it was overflowing.

Does anyone have any insights on this?

eliafrigieri commented 7 years ago

I'm in the same situation. Every process I create import keras (I think because it's keras process creating the child), but once all the process are created and running the speed of loading increse a lot.

ciararogerson commented 7 years ago

In case it helps, I was able to get the sample running without importing keras on child processes by including the import keras commands inside the get_model() function in data_generator_sample as this was the only place they were used. I'm not sure I'll be able to get it working like that in the full version of my project, but it may be an option for some people.

eliafrigieri commented 7 years ago

I thought the same solution, but it's not applicable in my solution so I don't tryed at all. The question is: this issue appears only on windows? It's a bug of keras?

isaacgerg commented 7 years ago

It appears to be a bug in windows but it could also be a poor assumption made in keras wrt multiprocessing that manifests on windows.

ciararogerson commented 7 years ago

My understanding of it is that if using multiprocessing with Windows you can't reference local variables from the point at which you call the process. You need to pass all variables explicitly via the args input. I think it should be possible to adapt the keras code by defining multiprocessing version of data_generator_task() outside the scope of the generator and passing a generator / stop / queue etc into it. That way it could work as a standalone function and can be spawned across multiple processes on any platform. Maybe best for windows users to try this option.

isaacgerg commented 7 years ago

@ciararogerson This sounds reasonable and it looks like it is consistent with the example in #5510. Would you agree?

eliafrigieri commented 7 years ago

Today I've tryed the "multiprocessing.py" test on a Mac, with the same configuration of my: python 2.7.13, keras 1.0.8 and works fine. So the problem is only for windows, now we have the evidence

isaacgerg commented 7 years ago

@eliafrigieri Great work!

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

isaacgerg commented 7 years ago

This is still an issue for python 3.5, windows 7.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

zyavrik commented 7 years ago

Will anybody fix it eventually?

Harshini-Gadige commented 5 years ago

Looks like there is a PR created and merged for this. So can this be closed now ?

prafulag commented 5 years ago

I have the same issue in predict_generator, however fit_generator is working find with multiprocessing

keras-team / keras

Multiprocessing using fit_generator(pickle_safe=True) fails #5071