Closed kevinkit closed 6 years ago
Thank you for your inquiry. The objective of stratification is to make the percentage of each label similar between train and test splits. In your code, it looks as though you are comparing the distribution of labels from all data to the distribution of labels from the train data. While the label distribution from all of the data should be similar to the label distribution of the train data, I believe you really want to compare the distribution of labels from train data to the distribution of labels of test data. They should be similar. Here is some simplified code.
from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np
from matplotlib import pyplot as plt
AMOUNT_OF_CLASSES = 3
X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[1,0,1], [1,1,0], [1,0,1], [0,0,1], [1,1,0], [0,0,1], [1,0,0], [1,0,0]])
# now go for stratifcaation
msss = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)
for train_index, test_index in msss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
plt.figure
plt.subplot(3,1,1)
plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],y.sum(axis=0) / y.shape[0])
plt.title('All Data Distribution')
plt.subplot(3,1,2)
plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],y_train.sum(axis=0) / y_train.shape[0])
plt.title('Train Distribution')
plt.subplot(3,1,3)
plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],y_test.sum(axis=0) / y_test.shape[0])
plt.title('Test Distribution')
Thank you really much for this much appreciated detaileda answer. Now I got it !
Any Idea or "go-to" method that can be used to solve the original problem having an evenly distributed multilabel dataset?
That is an interesting problem that I have not looked into. For binary classification problems with imbalanced datasets, I've listed some materials below. I have not thought about how the methods could be used for multilabel datasets.
Review paper: https://sci2s.ugr.es/keel/dataset/includes/catImbFiles/2004-Batista-SIGKDD.pdf SMOTE (one technique) in Python: https://github.com/scikit-learn-contrib/imbalanced-learn
Hello and thank you for this project.
I am new to machine learning and have a little bit of trouble getting started with this.
If i got it correctly this method is used, when I have unevenly distributed multilabel dataset, in order to get an evenly distributed one.
To test this I used one of the toy examples and changed it a little, so that I have an uneven distribution over 3 classes.
If I take a look at the distribution at the beginning it will look like the following:
If I now do the stratification like this:
and look at the distribution at the end:
I will get the following:
So it still looks like I do not have an even distribution among the classes.
Is this not what this is used for? How could I achieve that every class is evenly distributed over the data? Thank you really much