More datasets - Githubissues

janvanrijn commented 6 years ago

Try to get more datasets! (Would be good to keep the 'OML100' principle) For example, @joaquinvanschoren mentioned that he as several datasets that comply to the requirements. Related to #13

joaquinvanschoren commented 6 years ago

Yes. I'll look at importing the mldata datasets asap. I'll leave them in_preparation and we can check how suitable they are.

On Fri, 22 Dec 2017 at 10:22 janvanrijn notifications@github.com wrote:

Try to get more datasets! (Would be good to keep the 'OML100' principle) For example, @joaquinvanschoren https://github.com/joaquinvanschoren mentioned that he as several datasets that comply to the requirements. Related to #13 https://github.com/openml/OpenMLFirstBenchmarkSuite/issues/13

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openml/OpenMLFirstBenchmarkSuite/issues/15, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV233tEekkgVNHHE7hS4ZGL9urOqCks5tC3S4gaJpZM4RK0kP .

-- Thank you, Joaquin

janvanrijn commented 6 years ago

@joaquinvanschoren What is the status of this?

joaquinvanschoren commented 6 years ago

I've tried the mldata API. It allows me to get datasets, but only if I already know the name of the dataset, and they have no listing call. They also don't return the dataset description through the API.

I now contacted them to see if I can get an export of their database table. Still waiting on that.

janvanrijn commented 6 years ago

The following query results in the following list of datasets (note the slightly different minority class rule)

FROM dataset d, data_quality inst, data_quality att, data_quality attnom, data_quality attnum, data_quality cl, data_quality minc, data_quality maxc
WHERE inst.data = d.did
AND inst.quality = "NumberOfInstances"
AND att.data = d.did
AND att.quality = "NumberOfFeatures"
AND attnom.data = d.did
AND attnom.quality = "NumberOfSymbolicFeatures"
AND attnum.data = d.did
AND attnum.quality = "NumberOfNumericFeatures"
AND cl.data = d.did
AND cl.quality = "NumberOfClasses"
AND minc.data = d.did
AND minc.quality = "MinorityClassSize"
AND maxc.data = d.did
AND maxc.quality = "MajorityClassSize"
AND inst.value > 500
AND att.value < 5000
AND minc.value > 10
AND d.name NOT LIKE "BNG(%"
AND d.name NOT LIKE "SEA(%"
AND d.name NOT LIKE "fri_%"
AND d.name NOT LIKE "QSAR%"
GROUP BY d.name
LIMIT 5000

(list in next post)

janvanrijn commented 6 years ago

Length (~400)

40517   20_newsgroups.drift
727     2dplanes
1557    abalone
40993   ada_agnostic
1037    ada_prior
1590    adult
1119    adult-census
40657   agaricus-lepiota
1235    Agrawal1
734     ailerons
1169    airlines
1240    AirlinesCodrnaAdult
40707   allbp
40708   allrep
40906   Aloi
4135    Amazon_employee_access
970     analcatdata_authorship
1014    analcatdata_dmft
966     analcatdata_halloffame
728     analcatdata_supreme
989     anneal
40886   Annthyroid
40889   AnomalyData_10percent
40891   AnomalyData_10percent_hd
40887   AnomalyData_5percent
40888   AnomalyData_5percent_hd
40795   ant-1.7
949     arsenic-female-bladder
950     arsenic-female-lung
947     arsenic-male-bladder
951     arsenic-male-lung
40892   Artificial_unsupervised
1459    artificial-characters
40981   Australian
1547    autoUniv-au1-1000
1548    autoUniv-au4-2500
1555    autoUniv-au6-1000
1549    autoUniv-au6-750
1552    autoUniv-au7-1100
1553    autoUniv-au7-700
997     balance-scale
914     balloon
40905   banana
1558    bank-marketing
833     bank32nh
725     bank8FM
40958   Bankdata
1462    banknote-authentication
185     baseball
40976   Bike
4134    Bioresponse
40588   birds
1464    blood-transfusion-service-center
872     boston
825     boston_corrected
40698   breast
15  breast-w
822     cal_housing
40802   camel-1.2
40803   camel-1.4
40804   camel-1.6
40975   car
40664   car-evaluation
40764   Car-test
40763   Car-train
1560    cardiotocography
4535    Census-Income
1118    chess
40701   churn
40927   CIFAR_10
40926   CIFAR_10_small
23380   cjs
40666   clean2
1227    Click_prediction_small
40920   Climate
40995   climate-model-simulation-chrashes
40994   climate-model-simulation-crashes
40805   CM1
983     cmc
1468    cnae-9
351     codrna
1241    codrnaNorm
897     colleges_aaup
930     colleges_usnews
40668   connect-4
40766   Convex-test
40765   Convex-train
1596    covertype
23386   CoverType5percent
761     cpu_act
735     cpu_small
29  credit-approval
31  credit-g
1597    creditcard
4154    CreditCardSubset
40712   crx
6332    cylinder-bands
803     delta_ailerons
819     delta_elevators
40921   Devnagari_Script_Dataset
    did     name
40923   Devnagari-Script
41008   diabetes
4541    Diabetes130US
40713   dis
774     disclosure_x_bias
827     disclosure_x_noise
795     disclosure_x_tampered
931     disclosure_z
489     dj30-1985-2003
40670   dna
1471    eeg-eye-state
151     electricity
846     elevators
40589   emotions
40590   enron
40865   epilobee_mortality
990     eucalyptus
1044    eye_movements
23518   F2Pmodel3smoted
40479   F2pmodel4scaledsmoted
23519   F2Pmodel4smoted
40996   Fashion-MNIST
389     fbis.wc
40967   feedback
40969   feedback_1
1475    first-order-theorem-proving
40702   flare
40904   Forest_Cover
40695   GAMETES_Epistasis_2-Way_1000atts_0.4H_EDM-1_EDM-1_...
40696   GAMETES_Epistasis_2-Way_20atts_0.1H_EDM-1_1
40697   GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1
40654   GAMETES_Epistasis_3-Way_20atts_0.2H_EDM-1_1
40655   GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_50_E...
40656   GAMETES_Heterogeneity_20atts_1600_Het_0.4_0.2_75_E...
1476    gas-drift
1477    gas-drift-different-concentrations
40591   genbase
40771   GermanCredit-train
4538    GesturePhaseSegmentationProcessed
1038    gina_agnostic
1042    gina_prior
1041    gina_prior2
1478    har
23512   higgs
1566    hill-valley
1039    hiva_agnostic
41010   Homicide
40864   Honey_bee_Seasonal_mortality
821     house_16H
843     house_8L
823     houses
853     housing
40895   Http_lowdim
152     Hyperplane_10_1E-3
153     Hyperplane_10_1E-4
40676   hypothyroid
1480    ilpd
40592   image
273     IMDB.drama
40978   Internet-Advertisements
382     ipums_la_97-small
381     ipums_la_98-small
378     ipums_la_99-small
300     isolet
375     JapaneseVowels
40816   jm1
41001   jungle_chess_2pcs_endgame_complete
40999   jungle_chess_2pcs_endgame_elephant_elephant
41004   jungle_chess_2pcs_endgame_lion_elephant
41007   jungle_chess_2pcs_endgame_lion_lion
41000   jungle_chess_2pcs_endgame_panther_elephant
40998   jungle_chess_2pcs_endgame_panther_lion
41003   jungle_chess_2pcs_endgame_rat_elephant
41006   jungle_chess_2pcs_endgame_rat_lion
41002   jungle_chess_2pcs_endgame_rat_panther
41005   jungle_chess_2pcs_endgame_rat_rat
40817   kc1
1063    kc2
839     kdd_el_nino-small
981     kdd_internet_usage
993     kdd_ipums_la_97-small
1002    kdd_ipums_la_98-small
1018    kdd_ipums_la_99-small
976     kdd_JapaneseVowels
1004    kdd_synthetic_control
1111    KDDCup09_appetency
1112    KDDCup09_churn
1114    KDDCup09_upselling
807     kin8nm
1481    kr-vs-k
3   kr-vs-kp
40776   KR-vs-KP-test
40775   KR-vs-KP-train
184     kropt
40593   langLog
1483    ldpa
154     LED(50000)
40677   led24
40678   led7
977     letter
    did     name
1223    letter-challenge-labeled.arff
1222    letter-challenge-unlabeled.arff
1485    madelon
40778   Madelon-test
40777   Madelon-train
40679   magic
1120    MagicTelescope
40907   mammography
40827   mc1
757     meta
978     mfeat-factors
971     mfeat-fourier
1020    mfeat-karhunen
962     mfeat-morphological
40979   mfeat-pixel
995     mfeat-zernike
40966   MiceProtein
1515    micro-mass
40754   ML2017-challenge-1
40755   ML2017-challenge-2
40756   ML2017-challenge-3
554     mnist_784
40493   Model4scaledsmoted
40680   mofn-3-7-10
333     monks-problems-1
334     monks-problems-2
335     monks-problems-3
40829   mozilla4
40897   Mulcross
809     mushroom
1116    musk
881     mv
4140    NELL
1486    nomao
41012   NPSdecay
23517   numerai28.6
1568    nursery
392     oh0.wc
401     oh10.wc
386     oh15.wc
394     oh5.wc
311     oil_spill
1491    one-hundred-plants-margin
1492    one-hundred-plants-shape
1493    one-hundred-plants-texture
980     optdigits
40735   ozone_level
1487    ozone-level-8hr
1021    page-blocks
40706   parity5_plus_5
40869   pathogen_survey_dataset
802     pbcseq
40831   pc1
40832   pc2
40833   pc3
40873   pc4
40834   PC5
40908   Pen_global
1019    pendigits
4534    PhishingWebsites
1489    phoneme
4709    Physical_Activity_Recognition_Dataset_Using_Five_S...
4675    Physical_Activity_Recognition_Dataset_Using_Smartp...
1451    PieChart1
1452    PieChart2
1453    PieChart3
1454    PieChart4
40715   pima
1443    PizzaCutter1
1444    PizzaCutter3
4546    Plants
354     poker
1569    poker-hand
722     pol
871     pollen
40725   PredictItTesting
470     profb
40839   prop-1
40840   prop-2
40841   prop-3
40842   prop-4
40843   prop-5
40844   prop-6
752     puma32H
816     puma8NH
948     quake
156     RandomRBF_0_0
157     RandomRBF_10_1E-3
158     RandomRBF_10_1E-4
159     RandomRBF_50_1E-3
160     RandomRBF_50_1E-4
391     re0.wc
40594   reuters
40748   Reuters-Corn
40747   Reuters-Grain
1496    ringnorm
717     rmftsa_ladata
741     rmftsa_sleepdata
40922   Run_or_walk_information
23508   S1S1_1s50_31features_7tasks
    did     name
40900   Satellite
40734   satellite_image
182     satimage
40595   scene
40779   Secom-train
40984   segment
40878   seismic-bumps
1501    semeion
40781   Semeion-train
23383   SensorDataResource
826     sensory
40902   Shuttle_2percent
40901   Shuttle_7percent
38  sick
1502    skin-segmentation
40596   slashdot
40903   smtp
934     socmob
40687   solar-flare_2
1023    soybean
737     space_ga
44  spambase
954     spectrometer
40910   Speech
40536   SpeedDating
953     splice
1503    spoken-arabic-digit
1236    Stagger1
1237    Stagger2
1238    Stagger3
40982   steel-plates-fault
841     stock
770     strikes
40992   sylva_agnostic
1040    sylva_prior
377     synthetic_control
40985   tamilnadu-electricity
40912   Teste_1
40913   Teste_2
40499   texture
40690   threeOf9
314     thyroid_sick
40474   thyroid-allbp
40475   thyroid-allhyper
40476   thyroid-allhypo
40477   thyroid-allrep
40497   thyroid-ann
40478   thyroid-dis
50  tic-tac-toe
40704   titanic
40705   tokyo1
40850   tomcat
1507    twonorm
373     UNIX_user_data
40917   USvid
994     vehicle
357     vehicle_sensIT
1242    vehicleNorm
923     visualizing_soil
1527    volcanoes-a1
1528    volcanoes-a2
1529    volcanoes-a3
1530    volcanoes-a4
1531    volcanoes-b1
1532    volcanoes-b2
1533    volcanoes-b3
1534    volcanoes-b4
1535    volcanoes-b5
1536    volcanoes-b6
1537    volcanoes-c1
1538    volcanoes-d1
1539    volcanoes-d2
1540    volcanoes-d3
1541    volcanoes-d4
1016    vowel
1509    walking-activity
1526    wall-robot-navigation
940     water-treatment
979     waveform-5000
40784   Waveform-test
40783   Waveform-train
1510    wdbc
350     webdata_wXa
4139    Wikidata
40983   wilt
847     wind
40854   xalan-2.4
40855   xalan-2.5
40856   xalan-2.6
40858   xalan-2.8
40693   xd6
40861   xerces-1.4
40733   yeast
316     yeast_ml8

janvanrijn commented 6 years ago

FFR a slightly better query, that results in 300 potential datasets

SELECT MAX(d.did) AS did, d.name
FROM dataset d, data_quality inst, data_quality att, data_quality attnom, data_quality attnum, data_quality cl, data_quality minc, data_quality maxc
WHERE inst.data = d.did
AND inst.quality = "NumberOfInstances"
AND att.data = d.did
AND att.quality = "NumberOfFeatures"
AND attnom.data = d.did
AND attnom.quality = "NumberOfSymbolicFeatures"
AND attnum.data = d.did
AND attnum.quality = "NumberOfNumericFeatures"
AND cl.data = d.did
AND cl.quality = "NumberOfClasses"
AND minc.data = d.did
AND minc.quality = "MinorityClassSize"
AND maxc.data = d.did
AND maxc.quality = "MajorityClassSize"
AND inst.value >= 500
AND att.value < 5000
AND minc.value > 10
AND d.name NOT LIKE "BNG(%"
AND d.name NOT LIKE "SEA(%"
AND d.name NOT LIKE "fri_%"
AND d.name NOT LIKE "QSAR%"
AND d.name NOT IN (SELECT name FROM dataset, dataset_tag WHERE dataset.did = dataset_tag.id AND dataset_tag.tag = "OpenML100")
GROUP BY d.name
LIMIT 5000

janvanrijn commented 6 years ago

Going through the list chronologically, I have found several datasets that could potentially be included:

clean2 (PMLR dataset, need to find citation)
codrna (big)
covertype (did not fit original minority / majority class requirement)
crx (Added by @Yatoom , does not have description. Don't know what dataset it is)
Devnagari-Script (40921/40923 - contact the author ask for bibref)
Diabetes130US (plenty of information, how did we miss this one?)
dis (Added by @Yatoom, does not have description. Don't know what dataset it is)
dj30-1985-2003 (stock market, looks really neat. is bug though)
dna (Added by @Yatoom, does not have description. Don't know what dataset it is)
eye_movements (looks good, on what criteria did we dismiss it?)
F2Pmodel3smoted (private dataset, no bibtex, contact author?) - or similarly F2pmodel4scaledsmoted, F2Pmodel4smoted
Fashion-MNIST (relatively new, looks cool)
fbis.wc (does not pass the min/max class requirement i guess)
flare (Added by @Yatoom, does not have description. Don't know what dataset it is)

This is how far I came for now, please let me know if some of these are useful :)

mfeurer commented 6 years ago

Some more comments on those datasets:

covertype is 500k instances (too big)
Diabetes130US has more than 100k samples (too big)
eye_movements didn't show up in the original list of potential datasets, so it seems that the original SQL query didn't select this
Fashion MNIST looks really cool
fbis_wc has more than 500 features, that's why we excluded it

I think you actually meant @PGijsbers ?

mfeurer commented 6 years ago

Discussed with @giuseppec and @joaquinvanschoren and we will only use datasets uploaded until this very moment, and datasets uploaded afterwards because they are fixed versions of previous datasets.

joaquinvanschoren commented 6 years ago

I actually ran into most of these while working on the automated script from #17. You can check the results to explain why these are in/out:

clean2 -> derived from musk codrna -> Too large covertype -> Too large crx -> deactivated (duplicate of credit-approval) Devnagari-Script -> OK! Diabetes130US -> Slightly too large dis -> Extreme imbalance dj30-1985-2003 -> Slightly too large (also still in_preparation) dna -> OK, but no description. Needs digging. eye_movements -> grouped data, needs special data splits F2Pmodel3smoted, F2pmodel4scaledsmoted, F2Pmodel4smoted -> private datasets from a student of mine. Data from a company, likely not public anytime soon. Fashion-MNIST -> OK! (student of mine uploaded it) fbis.wc -> SparseARFF format (apparently we decided not to include these a while ago) flare -> Duplicate of solar-flare

mfeurer commented 6 years ago

This is now governed by the notebook for study generation.

openml / benchmark-suites

More datasets #15