keras-team / keras-preprocessing

Utilities for working with image data, text data, and sequence data.
Other
1.02k stars 444 forks source link

iterator does not lock access for index_array while using _get_index__ in multiprocess env #271

Open oak-tree opened 4 years ago

oak-tree commented 4 years ago

Background

Keras itrator implements __get_index__ as following : https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/image/iterator.py#L53

`

def __getitem__(self, idx):
    if idx >= len(self):
        raise ValueError('Asked to retrieve element {idx}, '
                         'but the Sequence '
                         'has length {length}'.format(idx=idx,
                                                      length=len(self)))
    if self.seed is not None:
        np.random.seed(self.seed + self.total_batches_seen)
    self.total_batches_seen += 1
    if self.index_array is None:
        self._set_index_array()
    index_array = self.index_array[self.batch_size * idx:
                                   self.batch_size * (idx + 1)]
    return self._get_batches_of_transformed_samples(index_array)`

This method is being used for async access while using fit_generator with multi process workers. With async access self.index_array and self.total_batches_seen will be seen differently for each worker.

Regular use as generator

Note that this issue does not happen as the next implementation does provide a lock mechanism for multi-thread access for calculating the next batch indices.

Dref360 commented 4 years ago

self.total_batches_seen is process-independent isn't? If process 1 changes the value, it wouldn't affect process 2.

When using threads, I agree that a race-condition will/may occurs, but it is rare that users use threads.

if self.index_array is None

this statement is should never true when used in fit_generator.

oak-tree commented 4 years ago

@Dref360 Hey, thanks for answering.

total_batches_seen

What would happen if seed is not none? then every worker will get different kind of seed. Moreover, the total_batches_seen is meaningless, isnt? as every async run of get_index will see the same version of total_batches_seen i.e this will never get above 1.

index_array

I agree, my mistake, the workers should never get the chance to update/set index_array because of the fact that index_array is never None.

EDIT

so I think that my question changed to :why is it necessary to update the np.seed from get_index?

Dref360 commented 4 years ago

What would happen if seed is not none? then every worker will get different kind of seed.

Yeah we probably should use a multiprocess.Value there instead.

version of total_batches_seen i.e this will never get above 1.

Not really. In keras.utils.OrderedEnqueuer we spawn workers processes using a pool. Each worker has a queue of task and the pool resets every epoch. So for each worker, total_batches_seen would be the number of tasks done by this worker for this epoch.

Hope this helps!

oak-tree commented 4 years ago

@Dref360

Not really. In keras.utils.OrderedEnqueuer we spawn workers processes using a pool. Each worker has a queue of task and the pool resets every epoch. So for each worker, total_batches_seen would be the number of tasks done by this worker for this epoch.

Are you sure that the attributes are being changed? Because of the keras.utils.OrderedEnqueuer uses async from pool I think that any change on the instance attributes is not being kept for the next iteration. i.e., every async call is transferred to the state of the iterator instance to the worker in async manners. So changes on this call do not affect the original instance.

Dref360 commented 4 years ago

Sorry I wasn't clear enough. The original instance doesn't change, but the instance per worker will. As I said, probably not the right behavior.

oak-tree commented 4 years ago

am... important to note that each worker get new copy of the instance every async call. so it wont save the changed attributes