naszilla / tabzilla

Apache License 2.0
122 stars 28 forks source link

Which datasets are used for main paper (98 datasets) and "small data" (57 datasets) #100

Open amueller opened 8 months ago

amueller commented 8 months ago

Hi. I'm trying to compare to some of the results in your work, but it's not clear to me which datasets were use for Table 1 and Table 2. The Datasets A file contains 108 datasets, and the Datasets B file contains 69 datasets, so I'm not sure which the 98 ones are. Really I care more about the 57 small datasets, but cutting off at those with 1250 or less instances doesn't yield 57 for either A or B or the combination.

amueller commented 8 months ago

The "easy_import" list seems to contain 175 classification tasks, 69 of which have less than 1250 instances.

crwhite14 commented 8 months ago

Hi Andreas, below are the 98 datasets from Table 1 and the 57 datasets from Table 2. Please let us know if you have more questions.

datasets_table_1 = ['openml__visualizing_environmental__3602', 'openml__labor__4', 'openml__monks-problems-2__146065', 'openml__tic-tac-toe__49', 'openml__dermatology__35', 'openml__cardiotocography__9979', 'openml__lung-cancer__146024', 'openml__sonar__39', 'openml__anneal__2867', 'openml__analcatdata_chlamydia__3739', 'openml__iris__59', 'openml__irish__3543', 'openml__heart-c__48', 'openml__ionosphere__145984', 'openml__hayes-roth__146063', 'openml__fri_c3_100_5__3779', 'openml__fri_c0_100_5__3620', 'openml__analcatdata_authorship__3549', 'openml__rabe_266__3647', 'openml__balance-scale__11', 'openml__acute-inflammations__10089', 'openml__MiceProtein__146800', 'openml__banknote-authentication__10093', 'openml__mushroom__24', 'openml__kr-vs-kp__3', 'openml__analcatdata_boxing1__3540', 'openml__musk__3950', 'openml__transplant__3748', 'openml__cjs__14967', 'openml__synthetic_control__3512', 'openml__car-evaluation__146192', 'openml__fertility__9984', 'openml__postoperative-patient-data__146210', 'openml__breast-w__15', 'openml__wdbc__9946', 'openml__car__146821', 'openml__visualizing_livestock__3731', 'openml__mfeat-factors__12', 'openml__Satellite__167211', 'openml__colic__25', 'openml__lymph__10', 'openml__wall-robot-navigation__9960', 'openml__wilt__146820', 'openml__scene__3485', 'openml__mfeat-karhunen__16', 'openml__sick__3021', 'openml__dna__167140', 'openml__socmob__3797', 'openml__page-blocks__30', 'openml__PhishingWebsites__14952', 'openml__spambase__43', 'openml__splice__45', 'openml__churn__167141', 'openml__colic__27', 'openml__ecoli__145977', 'openml__semeion__9964', 'openml__ozone-level-8hr__9978', 'openml__heart-h__50', 'openml__pc1__3918', 'openml__qsar-biodeg__9957', 'openml__autos__9', 'openml__pc4__3902', 'openml__hill-valley__145847', 'openml__satimage__2074', 'openml__pc3__3903', 'openml__mfeat-fourier__14', 'openml__Australian__146818', 'openml__credit-approval__29', 'openml__cylinder-bands__14954', 'openml__mfeat-zernike__22', 'openml__kc2__3913', 'openml__bank-marketing__14965', 'openml__phoneme__9952', 'openml__elevators__3711', 'openml__breast-cancer__145799', 'openml__SpeedDating__146607', 'openml__kc1__3917', 'openml__adult-census__3953', 'openml__ilpd__9971', 'openml__vehicle__53', 'openml__ada_agnostic__3896', 'openml__tae__47', 'openml__blood-transfusion-service-center__10101', 'openml__jasmine__168911', 'openml__LED-display-domain-7digit__125921', 'openml__diabetes__37', 'openml__Click_prediction_small__190408', 'openml__profb__3561', 'openml__steel-plates-fault__146817', 'openml__jm1__3904', 'openml__glass__40', 'openml__dresses-sales__125920', 'openml__mfeat-morphological__18', 'openml__eucalyptus__2079', 'openml__libras__360948', 'openml__yeast__145793', 'openml__cmc__23', 'openml__analcatdata_dmft__3560']

datasets_table_2 = ["openml__Australian__146818", "openml__LED-display-domain-7digit__125921", "openml__MiceProtein__146800", "openml__acute-inflammations__10089", "openml__analcatdata_authorship__3549", "openml__analcatdata_boxing1__3540", "openml__analcatdata_chlamydia__3739", "openml__analcatdata_dmft__3560", "openml__anneal__2867", "openml__autos__9", "openml__balance-scale__11", "openml__blood-transfusion-service-center__10101", "openml__blood-transfusion-service-center__145836", "openml__breast-cancer__145799", "openml__breast-w__15", "openml__colic__25", "openml__colic__27", "openml__credit-approval__29", "openml__cylinder-bands__14954", "openml__dermatology__35", "openml__diabetes__37", "openml__dresses-sales__125920", "openml__ecoli__145977", "openml__eucalyptus__2079", "openml__fertility__9984", "openml__fri_c0_100_5__3620", "openml__fri_c3_100_5__3779", "openml__glass__40", "openml__hayes-roth__146063", "openml__heart-c__48", "openml__heart-h__50", "openml__hill-valley__145847", "openml__ilpd__9971", "openml__ionosphere__145984", "openml__iris__59", "openml__irish__3543", "openml__kc2__3913", "openml__labor__4", "openml__lung-cancer__146024", "openml__lymph__10", "openml__monks-problems-2__146065", "openml__pc1__3918", "openml__postoperative-patient-data__146210", "openml__profb__3561", "openml__qsar-biodeg__9957", "openml__rabe_266__3647", "openml__socmob__3797", "openml__sonar__39", "openml__synthetic_control__3512", "openml__tae__47", "openml__tic-tac-toe__49", "openml__transplant__3748", "openml__vehicle__53", "openml__visualizing_environmental__3602", "openml__visualizing_livestock__3731", "openml__wdbc__9946", "openml__yeast__145793"]
LennartPurucker commented 5 months ago

I just saw this issue. Are you aware that the datasets for Table 2 have a duplicate? "openml__blood-transfusion-service-center__10101", "openml__blood-transfusion-service-center__145836"?

duncanmcelfresh commented 2 months ago

@LennartPurucker thanks for pointing this out - cc @crwhite14 . so we could remove the duplicate dataset from results that include it.

it looks like we accidentally pulled two different openML tasks (https://openml.org/search?type=task&id=145836 and https://openml.org/search?type=task&id=10101) which appear to be identical, because they are based on the same dataset (https://openml.org/search?type=data&id=1464)