mldata - Githubissues

PGijsbers commented 5 years ago

I started sifting through mldata datasets about a year ago, never had the time to finish. This is a dump of my progress.

Ignoring all datasets for which the all descriptive features match, and those which were not valid arff.

The following datasets have matching names, but differ in either instances, features, or missing values:

[x] datasets-numeric-autompg : [196, 831]
[x] datasets-numeric-sleep : [205, 739]
[x] letter : [6, 74, 247, 977, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386]
[x] regression-datasets-autompg : [196, 831]
[x] satimage : [182, 1183]
[x] shuttle : [40685]
[x] splice-ida : [1579]
[x] splice_scale : [1579]
[x] statlib-20050214-cars : [40700]
[x] statlib-20050214-hip : [490, 898]
[x] svmguide3 : [1589]
[x] uci-20070111-arrhythmia : [5, 1017]
[x] uci-20070111-autompg : [196, 831]
[x] uci-20070111-dermatology : [35, 129, 263, 1010]
[x] uci-20070111-hayes-roth_test : [329, 974]
[x] uci-20070111-kdd_el_nino-small: [839]
[x] uci-20070111-sleep : [205, 739]
[x] uci-20070111-spectf_test : [1181]
[x] uci-20070111-spectf_train : [1181]
[x] uci-20070111-spect_test : [1180]
[x] uci-20070111-spect_train : [1180]
[x] vowel : [307, 1016]

The following datasets have matching names, but differ in more than one way:

[x] breast-cancer : [13, 77, 1434, 23499]
[x] breast-cancer_scale : [13]
[x] cadata : [41156]
[x] cpusmall : [561, 796]
[x] cpusmall_scale : [561, 796]
[x] global-earthquakes : [209, 550, 772]
[x] image-ida : [40592]
[x] mauna-loa-atmospheric-co2 : [41187]
[x] mg : [1433, 1589]
[x] natural-scenes-data : [312, 40595]
[x] statlib-20050214-disclosure_z: [40713]
[x] statlib-20050214-papir_1 : [486, 487]

The following datasets do not have matching names,but have the same number of instances, features and missing values:

[ ] australian : [40981]
[ ] australian_scale : [40981]
[ ] cadata : [537, 823]
[ ] cpusmall : [227, 562, 735]
[ ] cpusmall_scale : [227, 562, 735]
[ ] datasets-arie_ben_david-era : [1029]
[ ] datasets-arie_ben_david-lev : [1030]
[ ] datasets-arie_ben_david-swd : [593, 595, 606, 608, 623, 740, 751, 845, 910, 913]
[ ] datasets-numeric-autoprice : [195, 745]
[ ] datasets-numeric-housing : [531, 853]
[ ] datasets-numeric-mbagrade : [380]
[ ] diabetes-ida : [37]
[ ] drug-datasets-chang : [418]
[ ] drug-datasets-garrat : [413, 436]
[ ] drug-datasets-mtp : [405]
[ ] drug-datasets-penning : [404, 417, 423, 425]
[ ] drug-datasets-phen : [412]
[ ] drug-datasets-phenetyl1 : [419]
[ ] drug-datasets-qsabr1 : [440]
[ ] drug-datasets-qsabr2 : [441]
[ ] drug-datasets-rosowky : [413, 437]
[ ] drug-datasets-siddiqi : [436, 437]
[ ] drug-datasets-strupcz : [439]
[ ] drug-datasets-svensson : [404, 417, 425]
[ ] drug-datasets-tsutumi : [404, 423, 425]
[ ] drug-datasets-yokohoma1 : [417, 423, 425]
[ ] friedman-datasets-fri_c0_1000_10: [593, 606, 608, 623, 740, 751, 910, 913, 1028]
[ ] friedman-datasets-fri_c0_1000_25: [586, 589, 592, 620, 715, 723, 903, 917]
[ ] friedman-datasets-fri_c0_1000_5 : [599, 612, 628, 743, 813, 912]
[ ] friedman-datasets-fri_c0_1000_50: [583, 607, 618, 622, 797, 806, 837, 866]
[ ] friedman-datasets-fri_c0_100_10 : [585, 591, 634, 640, 762, 783, 789, 878]
[ ] friedman-datasets-fri_c0_100_25 : [625, 629, 639, 655, 768, 775, 812, 868]
[ ] friedman-datasets-fri_c0_100_5 : [594, 611, 656, 726, 829, 916, 1463]
[ ] friedman-datasets-fri_c0_100_50 : [587, 630, 636, 642, 716, 876, 922, 932]
[ ] friedman-datasets-fri_c0_250_10 : [602, 615, 647, 657, 793, 830, 863, 935]
[ ] friedman-datasets-fri_c0_250_25 : [605, 614, 644, 658, 746, 794, 832, 933]
[ ] friedman-datasets-fri_c0_250_5 : [596, 601, 613, 730, 744, 911]
[ ] friedman-datasets-fri_c0_250_50 : [619, 632, 638, 648, 769, 873, 877, 918]
[ ] friedman-datasets-fri_c0_500_10 : [604, 627, 641, 646, 824, 855, 869, 936]
[ ] friedman-datasets-fri_c0_500_25 : [581, 582, 584, 643, 779, 838, 879, 896]
[ ] friedman-datasets-fri_c0_500_5 : [597, 617, 631, 749, 792, 870]
[ ] friedman-datasets-fri_c0_500_50 : [616, 626, 637, 645, 766, 805, 920, 937]
[ ] friedman-datasets-fri_c1_1000_10: [595, 606, 608, 623, 740, 751, 845, 913, 1028]
[ ] friedman-datasets-fri_c1_1000_25: [586, 589, 592, 598, 715, 723, 849, 903]
[ ] friedman-datasets-fri_c1_1000_5 : [599, 609, 628, 799, 813, 912]
[ ] friedman-datasets-fri_c1_1000_50: [590, 607, 618, 622, 797, 806, 866, 904]
[ ] friedman-datasets-fri_c1_100_10 : [585, 621, 634, 640, 762, 783, 808, 878]
[ ] friedman-datasets-fri_c1_100_25 : [625, 639, 651, 655, 768, 775, 868, 889]
[ ] friedman-datasets-fri_c1_100_5 : [594, 611, 624, 726, 754, 916, 1463]
[ ] friedman-datasets-fri_c1_100_50 : [587, 600, 630, 642, 716, 850, 922, 932]
[ ] friedman-datasets-fri_c1_250_10 : [602, 615, 635, 657, 763, 793, 830, 863]
[ ] friedman-datasets-fri_c1_250_25 : [605, 644, 653, 658, 773, 794, 832, 933]
[ ] friedman-datasets-fri_c1_250_5 : [579, 596, 613, 744, 776, 911]
[ ] friedman-datasets-fri_c1_250_50 : [603, 619, 632, 638, 732, 873, 877, 918]
[ ] friedman-datasets-fri_c1_500_10 : [604, 627, 646, 654, 855, 869, 936, 943]
[ ] friedman-datasets-fri_c1_500_25 : [581, 584, 633, 643, 838, 879, 896, 926]
[ ] friedman-datasets-fri_c1_500_5 : [597, 617, 649, 749, 792, 884]
[ ] friedman-datasets-fri_c1_500_50 : [616, 626, 645, 650, 805, 888, 920, 937]
[ ] friedman-datasets-fri_c2_1000_10: [593, 595, 608, 623, 740, 751, 845, 910, 1028]
[ ] friedman-datasets-fri_c2_1000_25: [586, 592, 598, 620, 715, 723, 849, 917]
[ ] friedman-datasets-fri_c2_1000_5 : [609, 612, 628, 743, 799, 813]
[ ] friedman-datasets-fri_c2_1000_50: [583, 590, 607, 618, 797, 806, 837, 904]
[ ] friedman-datasets-fri_c2_100_10 : [585, 591, 621, 640, 783, 789, 808, 878]
[ ] friedman-datasets-fri_c2_100_25 : [625, 629, 639, 651, 768, 812, 868, 889]
[ ] friedman-datasets-fri_c2_100_5 : [611, 624, 656, 754, 829, 916, 1463]
[ ] friedman-datasets-fri_c2_100_50 : [587, 600, 636, 642, 716, 850, 876, 932]
[ ] friedman-datasets-fri_c2_250_10 : [602, 615, 635, 647, 763, 793, 863, 935]
[ ] friedman-datasets-fri_c2_250_25 : [614, 644, 653, 658, 746, 773, 832, 933]
[ ] friedman-datasets-fri_c2_250_5 : [579, 601, 613, 730, 744, 776]
[ ] friedman-datasets-fri_c2_250_50 : [603, 619, 632, 648, 732, 769, 873, 918]
[ ] friedman-datasets-fri_c2_500_10 : [604, 641, 646, 654, 824, 855, 936, 943]
[ ] friedman-datasets-fri_c2_500_25 : [581, 582, 584, 633, 779, 838, 896, 926]
[ ] friedman-datasets-fri_c2_500_5 : [617, 631, 649, 749, 870, 884]
[ ] friedman-datasets-fri_c2_500_50 : [616, 637, 645, 650, 766, 805, 888, 937]
[ ] friedman-datasets-fri_c3_1000_10: [593, 595, 606, 623, 751, 845, 910, 913, 1028]
[ ] friedman-datasets-fri_c3_1000_25: [589, 592, 598, 620, 723, 849, 903, 917]
[ ] friedman-datasets-fri_c3_1000_5 : [599, 609, 612, 743, 799, 912]
[ ] friedman-datasets-fri_c3_1000_50: [583, 590, 607, 622, 797, 837, 866, 904]
[ ] friedman-datasets-fri_c3_100_10 : [591, 621, 634, 640, 762, 789, 808, 878]
[ ] friedman-datasets-fri_c3_100_25 : [625, 629, 651, 655, 775, 812, 868, 889]
[ ] friedman-datasets-fri_c3_100_5 : [594, 624, 656, 726, 754, 829, 1463]
[ ] friedman-datasets-fri_c3_100_50 : [600, 630, 636, 642, 850, 876, 922, 932]
[ ] friedman-datasets-fri_c3_250_10 : [615, 635, 647, 657, 763, 830, 863, 935]
[ ] friedman-datasets-fri_c3_250_25 : [605, 614, 644, 653, 746, 773, 794, 933]
[ ] friedman-datasets-fri_c3_250_5 : [579, 596, 601, 730, 776, 911]
[ ] friedman-datasets-fri_c3_250_50 : [603, 619, 638, 648, 732, 769, 877, 918]
[ ] friedman-datasets-fri_c3_500_10 : [604, 627, 641, 654, 824, 855, 869, 943]
[ ] friedman-datasets-fri_c3_500_25 : [582, 584, 633, 643, 779, 838, 879, 926]
[ ] friedman-datasets-fri_c3_500_5 : [597, 631, 649, 792, 870, 884]
[ ] friedman-datasets-fri_c3_500_50 : [616, 626, 637, 650, 766, 805, 888, 920]
[ ] friedman-datasets-fri_c4_1000_10: [593, 595, 606, 608, 740, 845, 910, 913, 1028]
[ ] friedman-datasets-fri_c4_1000_25: [586, 589, 598, 620, 715, 849, 903, 917]
[ ] friedman-datasets-fri_c4_1000_50: [583, 590, 618, 622, 806, 837, 866, 904]
[ ] friedman-datasets-fri_c4_100_10 : [585, 591, 621, 634, 762, 783, 789, 808]
[ ] friedman-datasets-fri_c4_100_25 : [629, 639, 651, 655, 768, 775, 812, 889]
[ ] friedman-datasets-fri_c4_100_50 : [587, 600, 630, 636, 716, 850, 876, 922]
[ ] friedman-datasets-fri_c4_250_10 : [602, 635, 647, 657, 763, 793, 830, 935]
[ ] friedman-datasets-fri_c4_250_25 : [605, 614, 653, 658, 746, 773, 794, 832]
[ ] friedman-datasets-fri_c4_250_50 : [603, 632, 638, 648, 732, 769, 873, 877]
[ ] friedman-datasets-fri_c4_500_10 : [627, 641, 646, 654, 824, 869, 936, 943]
[ ] friedman-datasets-fri_c4_500_25 : [581, 582, 633, 643, 779, 879, 896, 926]
[ ] friedman-datasets-fri_c4_500_50 : [626, 637, 645, 650, 766, 888, 920, 937]
[ ] german-ida : [31, 1547]
[ ] germannumer : [1436, 1572]
[ ] germannumer_scale : [1436, 1572]
[ ] heart_scale : [53]
[ ] housing : [531, 853]
[ ] housing_scale : [531, 853]
[ ] iris : [1099, 1413]
[ ] mpg : [40700]
[ ] mpg_scale : [40700]
[ ] regression-datasets-2dplanes : [344, 564, 881, 901]
[ ] regression-datasets-ailerons : [296]
[ ] regression-datasets-auto_price : [207, 756]
[ ] regression-datasets-bank32nh : [308, 752]
[ ] regression-datasets-bank8fm : [189, 225, 807, 816]
[ ] regression-datasets-cal_housing : [537, 823]
[ ] regression-datasets-fried : [215, 344, 727, 881]
[ ] regression-datasets-housing : [531, 853]
[ ] regression-datasets-kin8nm : [225, 572, 725, 816]
[ ] regression-datasets-puma32h : [558, 833]
[ ] regression-datasets-puma8nh : [189, 572, 725, 807]
[ ] ringnorm-ida : [1507]
[ ] statlib-20050214-chatfield_4 : [695, 820]
[ ] statlib-20050214-chscase_census3: [670, 671, 672, 673, 906, 907, 908, 909]
[ ] statlib-20050214-chscase_census5: [670, 671, 672, 673, 906, 907, 908, 909]
[ ] statlib-20050214-chscase_geyser1: [712, 895]
[ ] statlib-20050214-csb_ch2 : [668, 692, 787, 874, 1096]
[ ] statlib-20050214-diggle_table_a1: [485, 693, 817, 835]
[ ] statlib-20050214-diggle_table_a2: [694, 818]
[ ] statlib-20050214-disclosure_z : [676, 699, 704, 709, 774, 795, 827, 931]
[ ] statlib-20050214-hutsof99_logis : [681, 804]
[ ] statlib-20050214-no2 : [522, 750, 40496]
[ ] statlib-20050214-pm10 : [547, 886, 40496]
[ ] statlib-20050214-prnn_synth : [464]
[ ] statlib-20050214-rabe_131 : [668, 692, 787, 874, 1096]
[ ] statlib-20050214-rabe_148 : [710, 894]
[ ] statlib-20050214-rabe_166 : [684, 919]
[ ] statlib-20050214-rabe_176 : [698, 929]
[ ] statlib-20050214-rabe_265 : [660, 780]
[ ] statlib-20050214-rabe_266 : [663, 782]
[ ] statlib-20050214-rabe_97 : [697, 928]
[ ] statlib-20050214-sleuth_case1202: [706, 891]
[ ] statlib-20050214-sleuth_case1501: [711, 946]
[ ] statlib-20050214-sleuth_case2002: [665, 902]
[ ] statlib-20050214-sleuth_ex1605 : [687, 755]
[ ] statlib-20050214-sleuth_ex1714 : [659, 777]
[ ] statlib-20050214-sleuth_ex2012 : [663, 782]
[ ] statlib-20050214-sleuth_ex2015 : [683, 864]
[ ] statlib-20050214-sleuth_ex2016 : [682, 862]
[ ] thyroid-ida : [40682]
[ ] twonorm-ida : [1496]
[ ] uci-20070111-2dplanes : [344, 564, 881, 901]
[ ] uci-20070111-ailerons : [296]
[ ] uci-20070111-autoprice : [195, 745]
[ ] uci-20070111-auto_price : [207, 756]
[ ] uci-20070111-bank32nh : [308, 752]
[ ] uci-20070111-bank8fm : [189, 225, 807, 816]
[ ] uci-20070111-cal_housing : [537, 823]
[ ] uci-20070111-fried : [215, 344, 727, 881]
[ ] uci-20070111-housing : [531, 853]
[ ] uci-20070111-kin8nm : [225, 572, 725, 816]
[ ] uci-20070111-mbagrade : [380]
[ ] uci-20070111-puma32h : [558, 833]
[ ] uci-20070111-puma8nh : [189, 572, 725, 807]
[ ] usps : [41082]
[ ] waveform-ida : [4551]

The following datasets do not match any of the above criteria:

None

PGijsbers commented 5 years ago

datasets-numeric-autompg -> 196, file has missing values but were not recognized as such (-MAX_INT). datasets-numeric-sleep -> 205, file has missing values but were not recognized as such (-999). letter -> Not sure. Somehow this dataset has 35k (instead of the original, /d/6 with 20k). All OpenML versions have either 20k samples or 1M. regression-datasets-autompg -> same as datasets-numeric-autompg satimage -> looks like /d/182 but contains 4440 more instances.

PGijsbers commented 5 years ago

shuttle -> 43.5k more instances, OpenML version matches this reference to the original. splice-ida -> based on this it seems train+test set and 184 instances that are unaccounted for. splice-scale -> by the same source, this seems to be just the train set, rescaled. note: the OpenML dataset has attributes listed as numeric, though they seem to be categorical in nature (DNA sequences of ACTG). statlib-20050214-cars -> See next comment. statlib-20050214-hip -> match. missing values misread svmguide3 -> difference of 41 instances explained by inclusion of test set (OpenML has only train) uci-20070111-arrhythmia -> match. missing values misread uci-20070111-autompg -> match. missing values misread uci-20070111-dermatology -> match. missing values misread uci-20070111-hayes-roth_test-> just the test set of the dataset that can be fined merged in /d/329 uci-20070111-kdd_el_nino-small -> match. missing values misread uci-20070111-sleep -> match. missing values misread uci-20070111-spectf -> no match, /d/1181 is an artifical dataset based on this one, but the original is not on OpenML. Train and test set are in separate files here. uci-20070111-spect -> no match, /d/1180 is an artifical dataset based on this one, but the original is not on OpenML. Train and test set are in separate files here. vowel -> /d/307 with speaker and sex attributes dropped

PGijsbers commented 5 years ago

https://gist.github.com/PGijsbers/d6d262241587a59af2d9715f06185dee

amueller commented 5 years ago

Thanks for looking into this. I somehow thought there were wayyy more datasets on mldata. I guess not all were available as arff?

prabhant commented 4 years ago

Update on this issue: Most of the ARFF datasets that are on mldata but not on openml are broken ARFF files, After fixing them manually we can upload them to OpenML.

amueller commented 4 years ago

What about the datasets that are not in ARFF

Sent from phone. Please excuse spelling and brevity.

On Tue, Dec 10, 2019, 05:36 prabhant notifications@github.com wrote:

Update on this issue: Most of ARFF the datasets that are on mldata but not on openml are broken ARFF files, After fixing them manually we can upload them to OpenML.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/openml/openml-data/issues/22?email_source=notifications&email_token=AADNYFTLSB2PBVOBLQ2CYELQX6LNDA5CNFSM4JCEYCE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGPH2OY#issuecomment-564034875, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADNYFU2ARQSRRUNIMNPUMTQX6LNDANCNFSM4JCEYCEQ .

prabhant commented 4 years ago

We don't have them right now.

openml / openml-data

mldata #22