keras-team / keras-preprocessing

Utilities for working with image data, text data, and sequence data.
Other
1.02k stars 444 forks source link

Calculating mean for flow_from_directory #142

Open colt18 opened 5 years ago

colt18 commented 5 years ago

Just as explained here https://github.com/keras-team/keras/issues/5532 We need some standardization for flow_from_directory. Currently we could use datagen.fit() to calculate std, mean, and principal components for the datasets in numpy formatting. What we need is some directory iterator that can calculate these parameters along the whole train directory structure.

rragundez commented 5 years ago

What you mean is that this is not enough (https://github.com/keras-team/keras-preprocessing/blob/1d4c601071b0bc83042140219bb23455753938dd/keras_preprocessing/image/image_data_generator.py#L657) because is done per batch? And you are talking about calculating across the complete dataset? If you're indeed talking about this I do not see the use case for it, within what the Iterators are for. Could you perhaps tell me what's the use case?

colt18 commented 5 years ago

Take MNIST dataset as an example. All 60k training sample and 10k validation samples are held on a numpy array. When we apply feature-wise normalization we set input mean to 0 over the dataset by calculating the entire mean of the dataset and subtracting it from each pixel (Correct me if I'm wrong). But doing this on per batch may result with inconsistency. Since whole dataset is not homogenically distributed we may have some batches subtracted by a higher mean value (the actual mean of the batch) while some of them with lower. Take this array for instance; A = [ 1,5,4,2,1,4,6,9,0,3,1,6,3,5,8,9,0,9,5,3,6,7,7,2,1,7,3,8,5,3,5,8] mean(A) = 4.56 Dividing A into 4 equal parts and denoting with A(1) to A(4) means are calculated as: mean(A(1)) = 4 mean(A(2)) = 4,37 mean(A(3)) =4,87 mean(A(4))= 5

Values hover around 4.56. I'm not sure if this may hamper the performance since it hinders "centering the data". This may result with sloping learning curves. Maybe I really should try out writing down the mnist data on to my disk and try flow() vs flow_from_directory() to better support my cause. I may even write down a paper if results turn out interpretable enough.

rragundez commented 5 years ago

Thanks for the clarification. Indeed this is a whole dataset vs batch standardization. In flow_from_directory you cannot use featurewise_center for example:

https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/image/image_data_generator.py#L677

you will receive a warning, and this makes sence because the whole idea of flow_from_directory is to not load the data in memory, which will be required for doing what you mean. In flow_from_directory you can only do samplewise_center or other related to the batch only.

Is this clear? please let me know. If not I will close this issue.

colt18 commented 5 years ago

Glad we are making progress. I got your point on using flow_from_directory is to prevent loading the data into memory. As a desing purpose, I fully understand that we shouldn't load all the data on to memory. But necessity may arouse and we would like to standardize it among the database instead of batches. So IMHO, feature_wise_center should take parameters batchwise, datasetwise and false instead of just true and false. batchwise should work like currently how it is, datasetwise should iterate the folders and calculate the mean but not necessarily load all the data at once. It is ok even it loads a portion into memory calculate mean, discard and take the second batch and so on. In the end return a single mean value and use it to standardize just like in the flow.

As I emphasized on my previous post, this even got a chance to be a futile attempt but I'll really try this out and present the results. You are free to close this issue sir. I think I can reincarnate once I come up with results. Thanks.

rragundez commented 5 years ago

@colt18 in theory what you are asking is possible but quite a change to the current flow. Basically in the __init__ of DataFrameIterator and DirectoryIterator we would have to calculate this mean and std if requested by the user by loading all the images (in batches or so). Then assigning the respective mean and std attribute to the class. If this were to be implemented it would slow down considerably the class instantiation, but perhaps a warning to the user and a good doc-string may suffice.

I also like your idea of having a single source of standardization with three levels: datawise, batch and None, but I'm afraid it would break backwards compatibility and it will have to be standardized across Iterators. But I like the concept of your proposal I must say.

pspeter commented 5 years ago

How about instead of putting that in the init, add a fit_from_directory and fit_from_dataframe, similar to the flows?

rragundez commented 5 years ago

I think that might be a good idea actually. What exactly do you propose should be calculated in this fit functions? I guess the same things than in .fit

austinmw commented 5 years ago

I'd also really like a clean way to precompute mean and std of my dataset which does not fit into memory and necessitates flow_from_directory. Currently I'm just dividing by 255.0 to normalize between [0,1] instead of properly subtracting mean and dividing by std, because doing so with flow_from_directory would be awkward.

rragundez commented 5 years ago

Seems like a very reasonable and useful feature. I'll work on it this weekend, and hope you can review it later.

lennijusten commented 4 years ago

Is there any update on this feature? I have to say that the information on this specific issue is scattered all over, but it seems to still be a "problem".

I'd also like to add that in regards to this post:

Glad we are making progress. I got your point on using flow_from_directory is to prevent loading the data into memory. As a desing purpose, I fully understand that we shouldn't load all the data on to memory. But necessity may arouse and we would like to standardize it among the database instead of batches. So IMHO, feature_wise_center should take parameters batchwise, datasetwise and false instead of just true and false. batchwise should work like currently how it is, datasetwise should iterate the folders and calculate the mean but not necessarily load all the data at once. It is ok even it loads a portion into memory calculate mean, discard and take the second batch and so on. In the end return a single mean value and use it to standardize just like in the flow.

As I emphasized on my previous post, this even got a chance to be a futile attempt but I'll really try this out and present the results. You are free to close this issue sir. I think I can reincarnate once I come up with results. Thanks.

If one set the feature_wise_center to datasetwise, there could be additional parameters that allow the user to specify a custom mean and std. This idea was also mentioned in a stack post

Yes - this is a really huge downside of Keras.ImageDataGenerator that you couldn't provide the standarization statistics on your own.

With these options specified, one wouldn't need to iterate through the whole dataset to calculate the statistics. If they are not specified, the dataset is iteratively loaded and the statistics are calculated with the knowledge that this would take some time.

Dref360 commented 4 years ago

I don't have time to work on this feature in the near future, PRs are welcome :)