Closed glemaitre closed 8 years ago
@glemaitre http://sci2s.ugr.es/keel/imbalanced.php
We could add a data loader/downloader for these files. Keel data format its like arff. I have a draft implementation for generic reading of keel dat files.
Pretty cool. And the link should be always up.
I have this from my MSc experiments:
def load_database(string, separator=","):
try:
f = open(string, "r")
s = [line for line in f]
f.close()
except:
raise Exception
s = filter(lambda e: e[0] != '@', s)
s = [v.strip().split(separator) for v in s]
df = np.array(s)
X = np.asarray(df[:,:-1], dtype=float)
d = {'positive': 1, 'negative': 0}
y = np.asarray([d[v[-1].strip()] if v[-1].strip() in d else v[-1].strip() for v in s])
return X, y
In fact there is a lot of redundant dataset which differs by the considered class during binarization.
I would think that it could be more efficient to download the original data and apply on the top the imbalancing. Furthermore all or almost all the data come from the UCI which should be also available in mldata.org, this we could directly use the scikit-learn downloader.
What is less clear to me is how to provide several imbalancing ratio for the same dataset and moreover in an easy manner for the user.
What are you thought about that.
@dvro @glemaitre Do we need this after #115?
We have 3 possibilities:
sklearn
and make it imbalanced.sklearn
by making a pipeline to load follow by the make_imbalance
.In my opinion I would go with choice 1 and make something about 3.
+1 for 1.
@glemaitre @chkoar agreed!
It could be nice to have a module allowing to load imbalanced dataset from the web.
A list (pp. 40) proposed by Z. Ding could be used for that matter. Some txt file where available at some points but we could take directly the original data.