openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
664 stars 90 forks source link

Many duplicate datasets #825

Open amueller opened 5 years ago

amueller commented 5 years ago
import openml
datasets_active = openml.datasets.list_datasets(status="active")
import pandas as pd
df = pd.DataFrame(datasets_active).T
unique_versions = df.drop_duplicates(subset="name")
duplicated = unique_versions.duplicated(subset=['MajorityClassSize', 'MaxNominalAttDistinctValues', 'MinorityClassSize',
       'NumberOfClasses', 'NumberOfFeatures', 'NumberOfInstances',
       'NumberOfInstancesWithMissingValues', 'NumberOfMissingValues',
       'NumberOfNumericFeatures', 'NumberOfSymbolicFeatures'], keep=False)
duplicated.sum()

990

(excluding QSAR I get 263)

amueller commented 5 years ago

Looking at the 263, these seem likely duplicates:

561 = 1420

996 = 1005

1442 = 1449

670 671 673 672

1506 4329

522 547

1571 1435

357 1242

1241 351

amueller commented 5 years ago

Can someone explain what the BNG datasets are? Most of the others 263 are BNG or Friedman's datasets

janvanrijn commented 5 years ago

BNG datasets: https://datascience.stackexchange.com/questions/26757/what-does-bng-stand-for/26964#26964

amueller commented 5 years ago

So is there a way to populate the description that you didn't enter when you uploaded them? ;) Can you add a new version?

janvanrijn commented 5 years ago

So is there a way to populate the description that you didn't enter when you uploaded them?

Now I feel obliged to do so, at least for the set of BNG datasets.

Can you add a new version?

Description can be updated by means of the wiki, so I presume no new version is needed.