Closed janvanrijn closed 6 years ago
Yes. I'll look at importing the mldata datasets asap. I'll leave them in_preparation and we can check how suitable they are.
On Fri, 22 Dec 2017 at 10:22 janvanrijn notifications@github.com wrote:
Try to get more datasets! (Would be good to keep the 'OML100' principle) For example, @joaquinvanschoren https://github.com/joaquinvanschoren mentioned that he as several datasets that comply to the requirements. Related to #13 https://github.com/openml/OpenMLFirstBenchmarkSuite/issues/13
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openml/OpenMLFirstBenchmarkSuite/issues/15, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV233tEekkgVNHHE7hS4ZGL9urOqCks5tC3S4gaJpZM4RK0kP .
-- Thank you, Joaquin
@joaquinvanschoren What is the status of this?
I've tried the mldata API. It allows me to get datasets, but only if I already know the name of the dataset, and they have no listing call. They also don't return the dataset description through the API.
I now contacted them to see if I can get an export of their database table. Still waiting on that.
The following query results in the following list of datasets (note the slightly different minority class rule)
FROM dataset d, data_quality inst, data_quality att, data_quality attnom, data_quality attnum, data_quality cl, data_quality minc, data_quality maxc
WHERE inst.data = d.did
AND inst.quality = "NumberOfInstances"
AND att.data = d.did
AND att.quality = "NumberOfFeatures"
AND attnom.data = d.did
AND attnom.quality = "NumberOfSymbolicFeatures"
AND attnum.data = d.did
AND attnum.quality = "NumberOfNumericFeatures"
AND cl.data = d.did
AND cl.quality = "NumberOfClasses"
AND minc.data = d.did
AND minc.quality = "MinorityClassSize"
AND maxc.data = d.did
AND maxc.quality = "MajorityClassSize"
AND inst.value > 500
AND att.value < 5000
AND minc.value > 10
AND d.name NOT LIKE "BNG(%"
AND d.name NOT LIKE "SEA(%"
AND d.name NOT LIKE "fri_%"
AND d.name NOT LIKE "QSAR%"
GROUP BY d.name
LIMIT 5000
(list in next post)
Length (~400)
40517 20_newsgroups.drift
727 2dplanes
1557 abalone
40993 ada_agnostic
1037 ada_prior
1590 adult
1119 adult-census
40657 agaricus-lepiota
1235 Agrawal1
734 ailerons
1169 airlines
1240 AirlinesCodrnaAdult
40707 allbp
40708 allrep
40906 Aloi
4135 Amazon_employee_access
970 analcatdata_authorship
1014 analcatdata_dmft
966 analcatdata_halloffame
728 analcatdata_supreme
989 anneal
40886 Annthyroid
40889 AnomalyData_10percent
40891 AnomalyData_10percent_hd
40887 AnomalyData_5percent
40888 AnomalyData_5percent_hd
40795 ant-1.7
949 arsenic-female-bladder
950 arsenic-female-lung
947 arsenic-male-bladder
951 arsenic-male-lung
40892 Artificial_unsupervised
1459 artificial-characters
40981 Australian
1547 autoUniv-au1-1000
1548 autoUniv-au4-2500
1555 autoUniv-au6-1000
1549 autoUniv-au6-750
1552 autoUniv-au7-1100
1553 autoUniv-au7-700
997 balance-scale
914 balloon
40905 banana
1558 bank-marketing
833 bank32nh
725 bank8FM
40958 Bankdata
1462 banknote-authentication
185 baseball
40976 Bike
4134 Bioresponse
40588 birds
1464 blood-transfusion-service-center
872 boston
825 boston_corrected
40698 breast
15 breast-w
822 cal_housing
40802 camel-1.2
40803 camel-1.4
40804 camel-1.6
40975 car
40664 car-evaluation
40764 Car-test
40763 Car-train
1560 cardiotocography
4535 Census-Income
1118 chess
40701 churn
40927 CIFAR_10
40926 CIFAR_10_small
23380 cjs
40666 clean2
1227 Click_prediction_small
40920 Climate
40995 climate-model-simulation-chrashes
40994 climate-model-simulation-crashes
40805 CM1
983 cmc
1468 cnae-9
351 codrna
1241 codrnaNorm
897 colleges_aaup
930 colleges_usnews
40668 connect-4
40766 Convex-test
40765 Convex-train
1596 covertype
23386 CoverType5percent
761 cpu_act
735 cpu_small
29 credit-approval
31 credit-g
1597 creditcard
4154 CreditCardSubset
40712 crx
6332 cylinder-bands
803 delta_ailerons
819 delta_elevators
40921 Devnagari_Script_Dataset
did name
40923 Devnagari-Script
41008 diabetes
4541 Diabetes130US
40713 dis
774 disclosure_x_bias
827 disclosure_x_noise
795 disclosure_x_tampered
931 disclosure_z
489 dj30-1985-2003
40670 dna
1471 eeg-eye-state
151 electricity
846 elevators
40589 emotions
40590 enron
40865 epilobee_mortality
990 eucalyptus
1044 eye_movements
23518 F2Pmodel3smoted
40479 F2pmodel4scaledsmoted
23519 F2Pmodel4smoted
40996 Fashion-MNIST
389 fbis.wc
40967 feedback
40969 feedback_1
1475 first-order-theorem-proving
40702 flare
40904 Forest_Cover
40695 GAMETES_Epistasis_2-Way_1000atts_0.4H_EDM-1_EDM-1_...
40696 GAMETES_Epistasis_2-Way_20atts_0.1H_EDM-1_1
40697 GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1
40654 GAMETES_Epistasis_3-Way_20atts_0.2H_EDM-1_1
40655 GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_50_E...
40656 GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_75_E...
1476 gas-drift
1477 gas-drift-different-concentrations
40591 genbase
40771 GermanCredit-train
4538 GesturePhaseSegmentationProcessed
1038 gina_agnostic
1042 gina_prior
1041 gina_prior2
1478 har
23512 higgs
1566 hill-valley
1039 hiva_agnostic
41010 Homicide
40864 Honey_bee_Seasonal_mortality
821 house_16H
843 house_8L
823 houses
853 housing
40895 Http_lowdim
152 Hyperplane_10_1E-3
153 Hyperplane_10_1E-4
40676 hypothyroid
1480 ilpd
40592 image
273 IMDB.drama
40978 Internet-Advertisements
382 ipums_la_97-small
381 ipums_la_98-small
378 ipums_la_99-small
300 isolet
375 JapaneseVowels
40816 jm1
41001 jungle_chess_2pcs_endgame_complete
40999 jungle_chess_2pcs_endgame_elephant_elephant
41004 jungle_chess_2pcs_endgame_lion_elephant
41007 jungle_chess_2pcs_endgame_lion_lion
41000 jungle_chess_2pcs_endgame_panther_elephant
40998 jungle_chess_2pcs_endgame_panther_lion
41003 jungle_chess_2pcs_endgame_rat_elephant
41006 jungle_chess_2pcs_endgame_rat_lion
41002 jungle_chess_2pcs_endgame_rat_panther
41005 jungle_chess_2pcs_endgame_rat_rat
40817 kc1
1063 kc2
839 kdd_el_nino-small
981 kdd_internet_usage
993 kdd_ipums_la_97-small
1002 kdd_ipums_la_98-small
1018 kdd_ipums_la_99-small
976 kdd_JapaneseVowels
1004 kdd_synthetic_control
1111 KDDCup09_appetency
1112 KDDCup09_churn
1114 KDDCup09_upselling
807 kin8nm
1481 kr-vs-k
3 kr-vs-kp
40776 KR-vs-KP-test
40775 KR-vs-KP-train
184 kropt
40593 langLog
1483 ldpa
154 LED(50000)
40677 led24
40678 led7
977 letter
did name
1223 letter-challenge-labeled.arff
1222 letter-challenge-unlabeled.arff
1485 madelon
40778 Madelon-test
40777 Madelon-train
40679 magic
1120 MagicTelescope
40907 mammography
40827 mc1
757 meta
978 mfeat-factors
971 mfeat-fourier
1020 mfeat-karhunen
962 mfeat-morphological
40979 mfeat-pixel
995 mfeat-zernike
40966 MiceProtein
1515 micro-mass
40754 ML2017-challenge-1
40755 ML2017-challenge-2
40756 ML2017-challenge-3
554 mnist_784
40493 Model4scaledsmoted
40680 mofn-3-7-10
333 monks-problems-1
334 monks-problems-2
335 monks-problems-3
40829 mozilla4
40897 Mulcross
809 mushroom
1116 musk
881 mv
4140 NELL
1486 nomao
41012 NPSdecay
23517 numerai28.6
1568 nursery
392 oh0.wc
401 oh10.wc
386 oh15.wc
394 oh5.wc
311 oil_spill
1491 one-hundred-plants-margin
1492 one-hundred-plants-shape
1493 one-hundred-plants-texture
980 optdigits
40735 ozone_level
1487 ozone-level-8hr
1021 page-blocks
40706 parity5_plus_5
40869 pathogen_survey_dataset
802 pbcseq
40831 pc1
40832 pc2
40833 pc3
40873 pc4
40834 PC5
40908 Pen_global
1019 pendigits
4534 PhishingWebsites
1489 phoneme
4709 Physical_Activity_Recognition_Dataset_Using_Five_S...
4675 Physical_Activity_Recognition_Dataset_Using_Smartp...
1451 PieChart1
1452 PieChart2
1453 PieChart3
1454 PieChart4
40715 pima
1443 PizzaCutter1
1444 PizzaCutter3
4546 Plants
354 poker
1569 poker-hand
722 pol
871 pollen
40725 PredictItTesting
470 profb
40839 prop-1
40840 prop-2
40841 prop-3
40842 prop-4
40843 prop-5
40844 prop-6
752 puma32H
816 puma8NH
948 quake
156 RandomRBF_0_0
157 RandomRBF_10_1E-3
158 RandomRBF_10_1E-4
159 RandomRBF_50_1E-3
160 RandomRBF_50_1E-4
391 re0.wc
40594 reuters
40748 Reuters-Corn
40747 Reuters-Grain
1496 ringnorm
717 rmftsa_ladata
741 rmftsa_sleepdata
40922 Run_or_walk_information
23508 S1S1_1s50_31features_7tasks
did name
40900 Satellite
40734 satellite_image
182 satimage
40595 scene
40779 Secom-train
40984 segment
40878 seismic-bumps
1501 semeion
40781 Semeion-train
23383 SensorDataResource
826 sensory
40902 Shuttle_2percent
40901 Shuttle_7percent
38 sick
1502 skin-segmentation
40596 slashdot
40903 smtp
934 socmob
40687 solar-flare_2
1023 soybean
737 space_ga
44 spambase
954 spectrometer
40910 Speech
40536 SpeedDating
953 splice
1503 spoken-arabic-digit
1236 Stagger1
1237 Stagger2
1238 Stagger3
40982 steel-plates-fault
841 stock
770 strikes
40992 sylva_agnostic
1040 sylva_prior
377 synthetic_control
40985 tamilnadu-electricity
40912 Teste_1
40913 Teste_2
40499 texture
40690 threeOf9
314 thyroid_sick
40474 thyroid-allbp
40475 thyroid-allhyper
40476 thyroid-allhypo
40477 thyroid-allrep
40497 thyroid-ann
40478 thyroid-dis
50 tic-tac-toe
40704 titanic
40705 tokyo1
40850 tomcat
1507 twonorm
373 UNIX_user_data
40917 USvid
994 vehicle
357 vehicle_sensIT
1242 vehicleNorm
923 visualizing_soil
1527 volcanoes-a1
1528 volcanoes-a2
1529 volcanoes-a3
1530 volcanoes-a4
1531 volcanoes-b1
1532 volcanoes-b2
1533 volcanoes-b3
1534 volcanoes-b4
1535 volcanoes-b5
1536 volcanoes-b6
1537 volcanoes-c1
1538 volcanoes-d1
1539 volcanoes-d2
1540 volcanoes-d3
1541 volcanoes-d4
1016 vowel
1509 walking-activity
1526 wall-robot-navigation
940 water-treatment
979 waveform-5000
40784 Waveform-test
40783 Waveform-train
1510 wdbc
350 webdata_wXa
4139 Wikidata
40983 wilt
847 wind
40854 xalan-2.4
40855 xalan-2.5
40856 xalan-2.6
40858 xalan-2.8
40693 xd6
40861 xerces-1.4
40733 yeast
316 yeast_ml8
FFR a slightly better query, that results in 300 potential datasets
SELECT MAX(d.did) AS did, d.name
FROM dataset d, data_quality inst, data_quality att, data_quality attnom, data_quality attnum, data_quality cl, data_quality minc, data_quality maxc
WHERE inst.data = d.did
AND inst.quality = "NumberOfInstances"
AND att.data = d.did
AND att.quality = "NumberOfFeatures"
AND attnom.data = d.did
AND attnom.quality = "NumberOfSymbolicFeatures"
AND attnum.data = d.did
AND attnum.quality = "NumberOfNumericFeatures"
AND cl.data = d.did
AND cl.quality = "NumberOfClasses"
AND minc.data = d.did
AND minc.quality = "MinorityClassSize"
AND maxc.data = d.did
AND maxc.quality = "MajorityClassSize"
AND inst.value >= 500
AND att.value < 5000
AND minc.value > 10
AND d.name NOT LIKE "BNG(%"
AND d.name NOT LIKE "SEA(%"
AND d.name NOT LIKE "fri_%"
AND d.name NOT LIKE "QSAR%"
AND d.name NOT IN (SELECT name FROM dataset, dataset_tag WHERE dataset.did = dataset_tag.id AND dataset_tag.tag = "OpenML100")
GROUP BY d.name
LIMIT 5000
Going through the list chronologically, I have found several datasets that could potentially be included:
This is how far I came for now, please let me know if some of these are useful :)
Some more comments on those datasets:
I think you actually meant @PGijsbers ?
Discussed with @giuseppec and @joaquinvanschoren and we will only use datasets uploaded until this very moment, and datasets uploaded afterwards because they are fixed versions of previous datasets.
I actually ran into most of these while working on the automated script from #17. You can check the results to explain why these are in/out:
clean2 -> derived from musk codrna -> Too large covertype -> Too large crx -> deactivated (duplicate of credit-approval) Devnagari-Script -> OK! Diabetes130US -> Slightly too large dis -> Extreme imbalance dj30-1985-2003 -> Slightly too large (also still in_preparation) dna -> OK, but no description. Needs digging. eye_movements -> grouped data, needs special data splits F2Pmodel3smoted, F2pmodel4scaledsmoted, F2Pmodel4smoted -> private datasets from a student of mine. Data from a company, likely not public anytime soon. Fashion-MNIST -> OK! (student of mine uploaded it) fbis.wc -> SparseARFF format (apparently we decided not to include these a while ago) flare -> Duplicate of solar-flare
This is now governed by the notebook for study generation.
Try to get more datasets! (Would be good to keep the 'OML100' principle) For example, @joaquinvanschoren mentioned that he as several datasets that comply to the requirements. Related to #13