keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.98k stars 19.47k forks source link

un balanced (number of 1s and number of 0s are very differnt) data gives bad results #2005

Closed Sandy4321 closed 7 years ago

Sandy4321 commented 8 years ago

if train and test data are biased to for example one class than training process will be biased http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/ http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html

https://www.researchgate.net/post/What_are_the_possible_approaches_for_solving_imbalanced_class_problems

Please make sure that the boxes below are checked before you submit your issue. Thank you!

lukedeo commented 8 years ago

This isn't a Keras issue.

Sandy4321 commented 8 years ago

https://www.quora.com/Which-balance-strategy-to-learn-from-my-very-imbalanced-dataset

so when data is un balanced there is may be as one of strategies When you update weights during a minibatch during training, consider the proportions of the two classes in the minibatch and then update the weights accordingly.

so if I have much more negative labelled samples then positive , then it may be good to create batches to have the same number of negative and positive samples. Or if I want to enforce negatives then I need the option to put more negatives to batches Of cause I may create data to be balanced by adding the same negatives many times , trick keras?

Sandy4321 commented 8 years ago

some info for answer may be found in https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/keras-users/LYo7sqE75N4/9K2TJHngCAAJ

I have tried to "balance" out the classes by setting the class_weight=class_weight={0:1, 1:100000}.

Sandy4321 commented 8 years ago

https://www.quora.com/In-classification-how-do-you-handle-an-unbalanced-training-set http://stackoverflow.com/questions/30486033/tackling-class-imbalance-scaling-contribution-to-loss-and-sgd http://metaoptimize.com/qa/questions/11636/training-neural-networks-using-stochastic-gradient-descent-on-data-with-class-imbalance http://wiki.pentaho.com/display/DATAMINING/SMOTE https://www.quora.com/Whats-a-good-approach-to-binary-classification-when-the-target-rate-is-minimal http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html#example-svm-plot-separating-hyperplane-unbalanced-py

Sandy4321 commented 8 years ago

https://github.com/fchollet/keras/issues/177 Loss scaling would happen inside objectives.py functions, using a class_weight parameter set in model.fit or model.train. The amount of changes needed to get it rolling would be minimal.

Sandy4321 commented 8 years ago

the problem is not so simple as seems to be, I put more links http://ro.uow.edu.au/cgi/viewcontent.cgi?article=10491&context=infopapers

A supervised learning approach for imbalanced data sets Giang H. Nguyen University of Wollongong, giang_nguyen@uow.edu.au Abdesselam Bouzerdoum University of Wollongong, bouzer@uow.edu.au Son Lam Phung University of Wollongong, phung@uow.edu.au

Sandy4321 commented 8 years ago

and more links

http://arxiv.org/pdf/1508.03422.pdf

#Cost-Sensitive Learning of Deep Feature
#Representations from Imbalanced Data

http://www.cs.utah.edu/~piyush/teaching/ImbalancedLearning.pdf

Learning from Imbalanced Data and even phd thesis

http://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=4544&context=etd A balanced approach to the multi-class imbalance problem Lawrence Mosley Iowa State University

Sandy4321 commented 8 years ago

this example how to deal with unbalanced data http://pastebin.com/0QHtPGzJ , but still this is not working for my task

danielgy commented 7 years ago

How are doing with your task? I met a same imbalance problem of classification from time series data sets, the proportion of the minority is about 0.2%. I tried the oversampling like SMOTE, but it didn't work.

NilsRethmeier commented 7 years ago

@Sandy4321 @danielgy Training

  1. WRONG Worth a shot in Keras (near zero effort) model.fit(X_train, Y_train, nb_epoch=5, batch_size=32, class_weight = 'auto') Undocumented, mentioned in Google group: will class balance each batch. (Does not work, but not throwing an error) In my mind this reduces learning issues due to imbalanced batch updates. Also, if you google oversampling and NNs you can hit on papers claiming that simple training set oversampling is valid (though simple).
  2. Consider using Keras Modelchecking Callback watching the validation loss. Validation accuracy may be misleading under fake balance (Val acc may overstate performance if you care more about Class=1). Neither is really good thoug.
  3. Consider validation_data instead of validation_split in fit(). That way you can provide an unbalanced validation set and val_loss becomes a better measure of real performance. (Not sure if this isn't implicitly taken care of with validation_split).

Evaluation:

Results are bad

Test on unblanced test set. Average Precision aka AUC of Precision Recall Curve (AUC of PR) In contrast to AUC this measure incorporates class imbalance. AUC_PR = average_precision_score(y_true=y_test, y_score=model.predict(X_test), average='weighted')

Let us know how this works for extreme class imbalance.

fabioperez commented 7 years ago

There is nothing related to class_weight = 'auto' in Keras code. Don't use it! Check https://github.com/fchollet/keras/issues/5116.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

NilsRethmeier commented 7 years ago

@fabioperez Thanks for pointing out the 'auto' mistake. Edited post. Nice catch. @Sandy4321

Happy to hear what you found.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.