trent-b / iterative-stratification

scikit-learn cross validators for iterative stratification of multilabel data
BSD 3-Clause "New" or "Revised" License
851 stars 75 forks source link

Getting started help #4

Closed kevinkit closed 6 years ago

kevinkit commented 6 years ago

Hello and thank you for this project.

I am new to machine learning and have a little bit of trouble getting started with this.

If i got it correctly this method is used, when I have unevenly distributed multilabel dataset, in order to get an evenly distributed one.

To test this I used one of the toy examples and changed it a little, so that I have an uneven distribution over 3 classes.

from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np
from matplotlib import pyplot as plt

AMOUNT_OF_CLASSES = 3
X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[1,0,1], [1,1,0], [1,0,1], [0,0,1], [1,1,0], [0,0,1], [1,0,0], [1,0,0]])

If I take a look at the distribution at the beginning it will look like the following:

dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
for i in range(0,AMOUNT_OF_CLASSES):
    dis[i] = y[:,i].sum()

# Show original distribution
plt.figure(0)
plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],dis)

image

If I now do the stratification like this:

# now go for stratifcaation
msss = MultilabelStratifiedShuffleSplit(n_splits=10, test_size=0.5, random_state=0)

cnt = 1
# distribution over all iterations
all_dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
for train_index, test_index in msss.split(X, y):
    iter_dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    for i in range(0,AMOUNT_OF_CLASSES):
        iter_dis[i] = y_train[:,i].sum()

    all_dis += iter_dis
    # Show new distribution (for the latest one at first)
    plt.figure(cnt)
    plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],iter_dis)

    cnt += 1

and look at the distribution at the end:


plt.figure(cnt+1)
plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],all_dis)    
plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],dis)
plt.title("Distribution after Stratification")
plt.legend(['Distribution after stratification','original distribution'])

I will get the following:

image

So it still looks like I do not have an even distribution among the classes.

Is this not what this is used for? How could I achieve that every class is evenly distributed over the data? Thank you really much

trent-b commented 6 years ago

Thank you for your inquiry. The objective of stratification is to make the percentage of each label similar between train and test splits. In your code, it looks as though you are comparing the distribution of labels from all data to the distribution of labels from the train data. While the label distribution from all of the data should be similar to the label distribution of the train data, I believe you really want to compare the distribution of labels from train data to the distribution of labels of test data. They should be similar. Here is some simplified code.

from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np
from matplotlib import pyplot as plt

AMOUNT_OF_CLASSES = 3
X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[1,0,1], [1,1,0], [1,0,1], [0,0,1], [1,1,0], [0,0,1], [1,0,0], [1,0,0]])

# now go for stratifcaation
msss = MultilabelStratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)

for train_index, test_index in msss.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

plt.figure
plt.subplot(3,1,1)
plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],y.sum(axis=0) / y.shape[0])
plt.title('All Data Distribution')
plt.subplot(3,1,2)
plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],y_train.sum(axis=0) / y_train.shape[0])
plt.title('Train Distribution')
plt.subplot(3,1,3)
plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],y_test.sum(axis=0) / y_test.shape[0])
plt.title('Test Distribution')

image

kevinkit commented 6 years ago

Thank you really much for this much appreciated detaileda answer. Now I got it !

Any Idea or "go-to" method that can be used to solve the original problem having an evenly distributed multilabel dataset?

trent-b commented 6 years ago

That is an interesting problem that I have not looked into. For binary classification problems with imbalanced datasets, I've listed some materials below. I have not thought about how the methods could be used for multilabel datasets.

Review paper: https://sci2s.ugr.es/keel/dataset/includes/catImbFiles/2004-Batista-SIGKDD.pdf SMOTE (one technique) in Python: https://github.com/scikit-learn-contrib/imbalanced-learn