openml / openml-r

R package to interface with OpenML
http://openml.github.io/openml-r/
Other
95 stars 37 forks source link

listOMLDataSets() returns less than 5000 #446

Closed annakrystalli closed 4 years ago

annakrystalli commented 4 years ago

In the help for listOMLDataSets() it states that by default, the first 5000 datasets will be returned but I'm only getting 2955. 🤷‍♀️

data_list <- OpenML::listOMLDataSets()
#> Downloading from 'http://www.openml.org/api/v1/json/data/list/limit/5000/status/active' to '<mem>'.
tibble::as_tibble(data_list)
#> # A tibble: 2,955 x 16
#>    data.id name  version status format tags  majority.class.…
#>      <int> <chr>   <int> <chr>  <chr>  <chr>            <int>
#>  1       2 anne…       1 active ARFF   ""                 684
#>  2       3 kr-v…       1 active ARFF   ""                1669
#>  3       4 labor       1 active ARFF   ""                  37
#>  4       5 arrh…       1 active ARFF   ""                 245
#>  5       6 lett…       1 active ARFF   ""                 813
#>  6       7 audi…       1 active ARFF   ""                  57
#>  7       8 live…       1 active ARFF   ""                  NA
#>  8       9 autos       1 active ARFF   ""                  67
#>  9      10 lymph       1 active ARFF   ""                  81
#> 10      11 bala…       1 active ARFF   ""                 288
#> # … with 2,945 more rows, and 9 more variables:
#> #   max.nominal.att.distinct.values <int>, minority.class.size <int>,
#> #   number.of.classes <int>, number.of.features <int>,
#> #   number.of.instances <int>,
#> #   number.of.instances.with.missing.values <int>,
#> #   number.of.missing.values <int>, number.of.numeric.features <int>,
#> #   number.of.symbolic.features <int>

Created on 2019-10-14 by the reprex package (v0.3.0)

giuseppec commented 4 years ago

@joaquinvanschoren , @janvanrijn I guess the number of active datasets was changed on the server (if I remember well, it always used to be more than the current 2955)?

Anyway, the OpenML server apparently now has only 2955 active data sets and the R function still does what it should: It returns the first 5000 active data sets (since the default value of the status arg is set to active you only get 2955). I clarified this in the documentation to avoid confusion.

By the way: If you also want the non-active datasets, you can always use data_list <- OpenML::listOMLDataSets(status = ""all). But I am not sure if people should use datasets that are not active? This is something that @joaquinvanschoren , @janvanrijn can tell.