Sandy4321 commented 8 years ago

if train and test data are biased to for example one class than training process will be biased http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/ http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html

https://www.researchgate.net/post/What_are_the_possible_approaches_for_solving_imbalanced_class_problems

Please make sure that the boxes below are checked before you submit your issue. Thank you!

[x ] Check that you are up-to-date with the master branch of Keras. You can update with: pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
[x ] If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with: pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
[x ] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short). http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

lukedeo commented 8 years ago

This isn't a Keras issue.

Sandy4321 commented 8 years ago

https://www.quora.com/Which-balance-strategy-to-learn-from-my-very-imbalanced-dataset

so when data is un balanced there is may be as one of strategies When you update weights during a minibatch during training, consider the proportions of the two classes in the minibatch and then update the weights accordingly.

so if I have much more negative labelled samples then positive , then it may be good to create batches to have the same number of negative and positive samples. Or if I want to enforce negatives then I need the option to put more negatives to batches Of cause I may create data to be balanced by adding the same negatives many times , trick keras?

Sandy4321 commented 8 years ago

some info for answer may be found in https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/keras-users/LYo7sqE75N4/9K2TJHngCAAJ

I have tried to "balance" out the classes by setting the class_weight=class_weight={0:1, 1:100000}.

Sandy4321 commented 8 years ago

https://www.quora.com/In-classification-how-do-you-handle-an-unbalanced-training-set http://stackoverflow.com/questions/30486033/tackling-class-imbalance-scaling-contribution-to-loss-and-sgd http://metaoptimize.com/qa/questions/11636/training-neural-networks-using-stochastic-gradient-descent-on-data-with-class-imbalance http://wiki.pentaho.com/display/DATAMINING/SMOTE https://www.quora.com/Whats-a-good-approach-to-binary-classification-when-the-target-rate-is-minimal http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html#example-svm-plot-separating-hyperplane-unbalanced-py

Sandy4321 commented 8 years ago

https://github.com/fchollet/keras/issues/177 Loss scaling would happen inside objectives.py functions, using a class_weight parameter set in model.fit or model.train. The amount of changes needed to get it rolling would be minimal.

Sandy4321 commented 8 years ago

the problem is not so simple as seems to be, I put more links http://ro.uow.edu.au/cgi/viewcontent.cgi?article=10491&context=infopapers

A supervised learning approach for imbalanced data sets Giang H. Nguyen University of Wollongong, giang_nguyen@uow.edu.au Abdesselam Bouzerdoum University of Wollongong, bouzer@uow.edu.au Son Lam Phung University of Wollongong, phung@uow.edu.au

Sandy4321 commented 8 years ago

and more links

1508.03422.pdf

#Cost-Sensitive Learning of Deep Feature
#Representations from Imbalanced Data

http://www.cs.utah.edu/~piyush/teaching/ImbalancedLearning.pdf

Learning from Imbalanced Data and even phd thesis

http://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=4544&context=etd A balanced approach to the multi-class imbalance problem Lawrence Mosley Iowa State University

Sandy4321 commented 8 years ago

this example how to deal with unbalanced data http://pastebin.com/0QHtPGzJ , but still this is not working for my task

danielgy commented 7 years ago

How are doing with your task? I met a same imbalance problem of classification from time series data sets, the proportion of the minority is about 0.2%. I tried the oversampling like SMOTE, but it didn't work.

NilsRethmeier commented 7 years ago

@Sandy4321 @danielgy Training

WRONG Worth a shot in Keras (near zero effort) model.fit(X_train, Y_train, nb_epoch=5, batch_size=32, class_weight = 'auto') Undocumented, mentioned in Google group: will class balance each batch. (Does not work, but not throwing an error) In my mind this reduces learning issues due to imbalanced batch updates. Also, if you google oversampling and NNs you can hit on papers claiming that simple training set oversampling is valid (though simple).
Consider using Keras Modelchecking Callback watching the validation loss. Validation accuracy may be misleading under fake balance (Val acc may overstate performance if you care more about Class=1). Neither is really good thoug.
Consider validation_data instead of validation_split in fit(). That way you can provide an unbalanced validation set and val_loss becomes a better measure of real performance. (Not sure if this isn't implicitly taken care of with validation_split).

Evaluation:

Results are bad

Test on unblanced test set. Average Precision aka AUC of Precision Recall Curve (AUC of PR) In contrast to AUC this measure incorporates class imbalance. AUC_PR = average_precision_score(y_true=y_test, y_score=model.predict(X_test), average='weighted')

Let us know how this works for extreme class imbalance.

fabioperez commented 7 years ago

There is nothing related to class_weight = 'auto' in Keras code. Don't use it! Check https://github.com/fchollet/keras/issues/5116.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

NilsRethmeier commented 7 years ago

@fabioperez Thanks for pointing out the 'auto' mistake. Edited post. Nice catch. @Sandy4321

What performance measure are you optimizing for? I struggled with this issue and found that it is not one necessarily a model issue but how we often measure performance. E.g. we turn sigmoid probabilities into binary via a threshold = 0.5. Your model may learn this class imbalance and put the actual best prediction threshold above or below 0.5. The ROC measure for example tries all thresholds and is quite robust against class imbalance. It can also give you things like the break even point etc. You can easily calculate the optimal threshold automatically, no need to learn it -- if your point is decision automation anyway.
Balancing is really artificial somehow -- IMHO after dealing with this. Some papers suggest to balance training and not test or balance both -- either is flawed IMHO. If the model learns class imbalance than why not. You can always compute this threshold on your tuning/validation set (not your test set to be rigorous).
1. Make sure your training, validation/tuning, test data have the same class balance (as in sklearn stratified sampling/ k-fold). This may seem artificial but any other sampling is simply random giving you random results.
2. (optional deeper analysis) then you could test how your model performs if you change validation or test set balance (is it robust against class balance changes).
a cheap way. Use multiple measures to derive a final image of performance. No measure alone is perfect and any model is usually optimized for a particular measure such as F1, accuracy, MAPE, ROC etc.

Happy to hear what you found.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

keras-team / keras

un balanced (number of 1s and number of 0s are very differnt) data gives bad results #2005

http://arxiv.org/pdf/1508.03422.pdf