Add default imbalanced dataset

scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

https://imbalanced-learn.org

MIT License

6.84k stars 1.28k forks source link

Add default imbalanced dataset #88

Closed glemaitre closed 8 years ago

glemaitre commented 8 years ago

It could be nice to have a module allowing to load imbalanced dataset from the web.

A list (pp. 40) proposed by Z. Ding could be used for that matter. Some txt file where available at some points but we could take directly the original data.

dvro commented 8 years ago

@glemaitre http://sci2s.ugr.es/keel/imbalanced.php

chkoar commented 8 years ago

We could add a data loader/downloader for these files. Keel data format its like arff. I have a draft implementation for generic reading of keel dat files.

glemaitre commented 8 years ago

Pretty cool. And the link should be always up.

dvro commented 8 years ago

I have this from my MSc experiments:

def load_database(string, separator=","):
    try:
        f = open(string, "r")
        s = [line for line in f]
        f.close()
    except:
        raise Exception

    s = filter(lambda e: e[0] != '@', s)
    s = [v.strip().split(separator) for v in s]
    df = np.array(s)
    X = np.asarray(df[:,:-1], dtype=float)
    d = {'positive': 1, 'negative': 0}
    y = np.asarray([d[v[-1].strip()] if v[-1].strip() in d else v[-1].strip() for v in s])

    return X, y

glemaitre commented 8 years ago

In fact there is a lot of redundant dataset which differs by the considered class during binarization.

I would think that it could be more efficient to download the original data and apply on the top the imbalancing. Furthermore all or almost all the data come from the UCI which should be also available in mldata.org, this we could directly use the scikit-learn downloader.

What is less clear to me is how to provide several imbalancing ratio for the same dataset and moreover in an easy manner for the user.

What are you thought about that.

chkoar commented 8 years ago

@dvro @glemaitre Do we need this after #115?

glemaitre commented 8 years ago

We have 3 possibilities:

we close that issue considering that the user can download the datasets through sklearn and make it imbalanced.
we provide overload the fuction from sklearn by making a pipeline to load follow by the make_imbalance.
we add new functions to load specific datasets which are redundant in the literature for the imbalancing problem.

In my opinion I would go with choice 1 and make something about 3.

chkoar commented 8 years ago

+1 for 1.

dvro commented 8 years ago

@glemaitre @chkoar agreed!