openml / openml-r

R package to interface with OpenML
http://openml.github.io/openml-r/
Other
95 stars 37 forks source link

How to specify more than one feature in "ignore.attribute" when uploading a dataset? #423

Closed FlorianPargent closed 5 years ago

FlorianPargent commented 5 years ago

What is the correct way to specify more than one feature to be ignored by OpenML in modeling?

The documentation of makeOMLDataSetDescription() says:

ignore.attribute [character(1)] Attributes that should be excluded in modelling, such as identifiers and indexes. Optional.

Using a character vector with length > 1 does not work although the dataset will be uploaded. This can be seen with the following dataset I uploaded, where I used ignore.attribute = c("FL_DATE", "CRS_DEP_TIME"):

dat = getOMLDataSet(41209) dat$desc$ignore.attribute

[1] "FL_DATECRS_DEP_TIME"

giuseppec commented 5 years ago

@janvanrijn , @joaquinvanschoren apparently it is not possible to upload a data set by adding multiple feature into the ignore_attribute field? I tried to add the features that should be ignored as comma-separated in the XML, e.g.,
<oml:ignore_attribute>index,col_1,col_2,col_3</oml:ignore_attribute>. However, I get Problem validating uploaded description file XML does not correspond to XSD schema.

Btw., I do not find the XSD schema anymore, it used to be here https://github.com/openml/website/blob/master/openml_OS/views/pages/api_new/v1/xsd/openml.data.upload.xsd now it's gone.

joaquinvanschoren commented 5 years ago

Hi Giuseppe, the correct XSD is in the API docs: https://www.openml.org/api_docs#!/data/post_data

Direct link: https://www.openml.org/api/v1/xsd/openml.data.upload

janvanrijn commented 5 years ago

Repository switched name. It's now: https://github.com/openml/OpenML/blob/master/openml_OS/views/pages/api_new/v1/xsd/openml.data.upload.xsd

joaquinvanschoren commented 5 years ago

I hope that is the exact same XSD file :)

giuseppec commented 5 years ago

@FlorianPargent I think I fixed it, could you check if everything you need works now without errors?

FlorianPargent commented 5 years ago

I reuploaded my dataset but still got the following warning:

In if (!is.na(val)) newXMLNode(name, as.character(val), parent = parent,  :
  Bedingung hat Länge > 1 und nur das erste Element wird benutzt

(Sorry for the German error message...)

Unfortunately it still does not work:

library(OpenML)
dat = getOMLDataSet(41242)
dat$desc$ignore.attribute

still returns: [1] "FL_DATECRS_DEP_TIME"

also,

task = convertOMLDataSetToMlr(dat)

returns

Fehler in makeSupervisedTask("regr", data, target, weights, blocking, coordinates,  : 
  Column names of data doesn't contain target var: ARR_DELAY

although

"ARR_DELAY" %in% names(dat$data)
[1] TRUE
> devtools::session_info()

Session info ----------------------------------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.1.456)           
 language (EN)                        
 collate  de_DE.UTF-8                 
 tz       Europe/Berlin               
 date     2018-11-07                  

Packages --------------------------------------------------------------------------------------------------------------------------------------------
 package      * version   date       source                                   
 assertthat     0.2.0     2017-04-11 CRAN (R 3.5.0)                           
 backports      1.1.2     2017-12-13 CRAN (R 3.5.0)                           
 base         * 3.5.1     2018-07-05 local                                    
 BBmisc         1.11      2018-11-07 Github (berndbischl/BBmisc@a5a4e45)      
 bindr          0.1.1     2018-03-13 CRAN (R 3.5.0)                           
 bindrcpp       0.2.2     2018-03-29 CRAN (R 3.5.0)                           
 checkmate      1.8.5     2017-10-24 CRAN (R 3.5.0)                           
 colorspace     1.3-2     2016-12-14 CRAN (R 3.5.0)                           
 compiler       3.5.1     2018-07-05 local                                    
 crayon         1.3.4     2017-09-16 CRAN (R 3.5.0)                           
 curl           3.2       2018-03-28 CRAN (R 3.5.0)                           
 data.table     1.11.8    2018-09-30 CRAN (R 3.5.0)                           
 datasets     * 3.5.1     2018-07-05 local                                    
 devtools       1.13.6    2018-06-27 CRAN (R 3.5.0)                           
 digest         0.6.18    2018-10-10 CRAN (R 3.5.0)                           
 dplyr          0.7.7     2018-10-16 CRAN (R 3.5.0)                           
 farff          1.0       2018-10-30 Github (mlr-org/farff@2e911b7)           
 fastmatch      1.1-0     2017-01-28 CRAN (R 3.5.0)                           
 ggplot2        3.1.0     2018-10-25 cran (@3.1.0)                            
 git2r          0.23.0    2018-07-17 CRAN (R 3.5.0)                           
 glue           1.3.0     2018-07-17 CRAN (R 3.5.0)                           
 graphics     * 3.5.1     2018-07-05 local                                    
 grDevices    * 3.5.1     2018-07-05 local                                    
 grid           3.5.1     2018-07-05 local                                    
 gtable         0.2.0     2016-02-26 CRAN (R 3.5.0)                           
 hms            0.4.2     2018-03-10 CRAN (R 3.5.0)                           
 httr           1.3.1     2017-08-20 CRAN (R 3.5.0)                           
 jsonlite       1.5       2017-06-01 CRAN (R 3.5.0)                           
 knitr          1.20      2018-02-20 CRAN (R 3.5.0)                           
 lattice        0.20-35   2017-03-25 CRAN (R 3.5.1)                           
 lazyeval       0.2.1     2017-10-29 CRAN (R 3.5.0)                           
 magrittr       1.5       2014-11-22 CRAN (R 3.5.0)                           
 Matrix         1.2-14    2018-04-13 CRAN (R 3.5.1)                           
 memoise        1.1.0     2017-04-21 CRAN (R 3.5.0)                           
 methods      * 3.5.1     2018-07-05 local                                    
 mlr          * 2.13.9000 2018-11-07 Github (mlr-org/mlr@f28c937)             
 munsell        0.5.0     2018-06-12 CRAN (R 3.5.0)                           
 OpenML       * 1.9       2018-11-07 Github (openml/openml-r@316cb8a)         
 parallel       3.5.1     2018-07-05 local                                    
 parallelMap    1.4       2018-11-07 Github (berndbischl/parallelMap@101b91d) 
 ParamHelpers * 1.11      2018-11-07 Github (berndbischl/ParamHelpers@0516926)
 pillar         1.3.0     2018-07-14 CRAN (R 3.5.0)                           
 pkgconfig      2.0.2     2018-08-16 CRAN (R 3.5.0)                           
 plyr           1.8.4     2016-06-08 CRAN (R 3.5.0)                           
 purrr          0.2.5     2018-05-29 CRAN (R 3.5.0)                           
 R6             2.3.0     2018-10-04 CRAN (R 3.5.0)                           
 Rcpp           0.12.19   2018-10-01 CRAN (R 3.5.0)                           
 readr        * 1.1.1     2017-05-16 CRAN (R 3.5.0)                           
 rlang          0.3.0.1   2018-10-25 cran (@0.3.0.1)                          
 rstudioapi     0.8       2018-10-02 CRAN (R 3.5.0)                           
 scales         1.0.0     2018-08-09 CRAN (R 3.5.0)                           
 splines        3.5.1     2018-07-05 local                                    
 stats        * 3.5.1     2018-07-05 local                                    
 stringi        1.2.4     2018-07-20 CRAN (R 3.5.0)                           
 survival       2.42-3    2018-04-16 CRAN (R 3.5.1)                           
 tibble         1.4.2     2018-01-22 CRAN (R 3.5.0)                           
 tidyselect     0.2.5     2018-10-11 CRAN (R 3.5.0)                           
 tools          3.5.1     2018-07-05 local                                    
 utils        * 3.5.1     2018-07-05 local                                    
 withr          2.1.2     2018-03-15 CRAN (R 3.5.0)                           
 XML            3.98-1.16 2018-08-19 CRAN (R 3.5.0)                           
 yaml           2.2.0     2018-07-25 CRAN (R 3.5.0)
FlorianPargent commented 5 years ago

I also have a related question: Where in the XSD files do I find the information which data types are actually required for the column names and data values in my datasets. As I understand it, the XSD file linked above only refers to the meta features in the dataset description but not to the dataset itself, right?

giuseppec commented 5 years ago

argh, forgot to fix another thing... It was like https://g.redditmedia.com/N5tZYbhFstt1m6z6zLSg14yiRT3RikeLvp48Z4lp1lo.gif?fm=mp4&mp4-fragmented=false&s=c80db07159d4a388a03844b4cd541e41

Hope it works now. Added a unit test that checks the whole process and it looked ok, i.e.:

    iris = mlr::getTaskData(iris.task)
    desc = makeOMLDataSetDescription(
      name = "iris",
      description = "iris with ignored features Sepal.Width and Petal.Length",
      ignore.attribute = c("Sepal.Width", "Petal.Length"),
      default.target.attribute = "Species"
    )
    d = makeOMLDataSet(desc, data = iris)
    did = uploadOMLDataSet(d)
    d2 = getOMLDataSet(did)
    convertOMLDataSetToMlr(d2)

Regarding your other question: Did you look at the .arff file that is uploaded? It contains also some further meta-info about the data set itself.

FlorianPargent commented 5 years ago

Thanks a lot for fixing, it works now!

FlorianPargent commented 5 years ago

unfortunately, there is a problem now with downloading datasets when no feature is specified in ignore.attribute:

library(OpenML)
library(data.table)

loadOMLConfig(path = "~/.openml/config", assign = TRUE)

dat = data.table(a = rnorm(100), b = rnorm(100), c = rnorm(100))

desc = makeOMLDataSetDescription(
  name = "test-ignore-bug",
  description = "this is just to test a bug",
  default.target.attribute = "a"
)

# create OML dataset
oml_dat = makeOMLDataSet(
  desc = desc,
  data = dat,
  colnames.old = colnames(dat),
  colnames.new = colnames(dat), 
  target.features = "a")

id = uploadOMLDataSet(oml_dat)

dat2 = getOMLDataSet(id)

deleteOMLObject(id, object = "data")

gives:

Fehler in getOMLDataSet(id) : 
  Assertion on 'desc$ignore.attribute' failed: Must be a subset of {'a','b','c','NA'}.
giuseppec commented 5 years ago

ok, I introduced a bug since I thought it would be a good idea to do stricter arg-checks. But this causes many other issues, especially with already uploaded data sets (and wrong/missing field names). Should work now again.