Closed Sandy4321 closed 7 years ago
This isn't a Keras issue.
https://www.quora.com/Which-balance-strategy-to-learn-from-my-very-imbalanced-dataset
so when data is un balanced there is may be as one of strategies When you update weights during a minibatch during training, consider the proportions of the two classes in the minibatch and then update the weights accordingly.
so if I have much more negative labelled samples then positive , then it may be good to create batches to have the same number of negative and positive samples. Or if I want to enforce negatives then I need the option to put more negatives to batches Of cause I may create data to be balanced by adding the same negatives many times , trick keras?
some info for answer may be found in https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/keras-users/LYo7sqE75N4/9K2TJHngCAAJ
I have tried to "balance" out the classes by setting the class_weight=class_weight={0:1, 1:100000}.
https://www.quora.com/In-classification-how-do-you-handle-an-unbalanced-training-set http://stackoverflow.com/questions/30486033/tackling-class-imbalance-scaling-contribution-to-loss-and-sgd http://metaoptimize.com/qa/questions/11636/training-neural-networks-using-stochastic-gradient-descent-on-data-with-class-imbalance http://wiki.pentaho.com/display/DATAMINING/SMOTE https://www.quora.com/Whats-a-good-approach-to-binary-classification-when-the-target-rate-is-minimal http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html#example-svm-plot-separating-hyperplane-unbalanced-py
https://github.com/fchollet/keras/issues/177 Loss scaling would happen inside objectives.py functions, using a class_weight parameter set in model.fit or model.train. The amount of changes needed to get it rolling would be minimal.
the problem is not so simple as seems to be, I put more links http://ro.uow.edu.au/cgi/viewcontent.cgi?article=10491&context=infopapers
A supervised learning approach for imbalanced data sets Giang H. Nguyen University of Wollongong, giang_nguyen@uow.edu.au Abdesselam Bouzerdoum University of Wollongong, bouzer@uow.edu.au Son Lam Phung University of Wollongong, phung@uow.edu.au
and more links
#Cost-Sensitive Learning of Deep Feature
#Representations from Imbalanced Data
http://www.cs.utah.edu/~piyush/teaching/ImbalancedLearning.pdf
Learning from Imbalanced Data and even phd thesis
http://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=4544&context=etd A balanced approach to the multi-class imbalance problem Lawrence Mosley Iowa State University
this example how to deal with unbalanced data http://pastebin.com/0QHtPGzJ , but still this is not working for my task
How are doing with your task? I met a same imbalance problem of classification from time series data sets, the proportion of the minority is about 0.2%. I tried the oversampling like SMOTE, but it didn't work.
@Sandy4321 @danielgy Training
model.fit(X_train, Y_train, nb_epoch=5, batch_size=32, class_weight = 'auto')
Undocumented, mentioned in Google group: will class balance each batch. (Does not work, but not throwing an error)
In my mind this reduces learning issues due to imbalanced batch updates.
Also, if you google oversampling and NNs you can hit on papers claiming that simple training set oversampling is valid (though simple).validation_data
instead of validation_split
in fit()
. That way you can provide an unbalanced validation set and val_loss becomes a better measure of real performance. (Not sure if this isn't implicitly taken care of with validation_split). Evaluation:
Results are bad
Test on unblanced test set.
Average Precision aka AUC of Precision Recall Curve (AUC of PR)
In contrast to AUC this measure incorporates class imbalance.
AUC_PR = average_precision_score(y_true=y_test, y_score=model.predict(X_test), average='weighted')
Let us know how this works for extreme class imbalance.
There is nothing related to class_weight = 'auto'
in Keras code. Don't use it! Check https://github.com/fchollet/keras/issues/5116.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
@fabioperez Thanks for pointing out the 'auto' mistake. Edited post. Nice catch. @Sandy4321
What performance measure are you optimizing for? I struggled with this issue and found that it is not one necessarily a model issue but how we often measure performance.
E.g. we turn sigmoid probabilities
into binary via a threshold = 0.5. Your model may learn this class imbalance and put the actual best prediction threshold above or below 0.5. The ROC measure for example tries
all thresholds and is quite robust against class imbalance. It can also give you things like the break even point etc. You can easily calculate the optimal threshold automatically, no need to learn it -- if your point is decision automation anyway.
Balancing is really artificial somehow -- IMHO after dealing with this. Some papers suggest to balance training and not test or balance both -- either is flawed IMHO. If the model learns class imbalance than why not. You can always compute this threshold on your tuning/validation set (not your test set to be rigorous).
a cheap way. Use multiple measures to derive a final image of performance. No measure alone is perfect and any model is usually optimized for a particular measure such as F1, accuracy, MAPE, ROC etc.
Happy to hear what you found.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
if train and test data are biased to for example one class than training process will be biased http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/ http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html
https://www.researchgate.net/post/What_are_the_possible_approaches_for_solving_imbalanced_class_problems
Please make sure that the boxes below are checked before you submit your issue. Thank you!