Closed FlorianPargent closed 5 years ago
@janvanrijn , @joaquinvanschoren apparently it is not possible to upload a data set by adding multiple feature into the ignore_attribute field?
I tried to add the features that should be ignored as comma-separated in the XML, e.g.,
<oml:ignore_attribute>index,col_1,col_2,col_3</oml:ignore_attribute>
.
However, I get
Problem validating uploaded description file XML does not correspond to XSD schema
.
Btw., I do not find the XSD schema anymore, it used to be here https://github.com/openml/website/blob/master/openml_OS/views/pages/api_new/v1/xsd/openml.data.upload.xsd now it's gone.
Hi Giuseppe, the correct XSD is in the API docs: https://www.openml.org/api_docs#!/data/post_data
Direct link: https://www.openml.org/api/v1/xsd/openml.data.upload
Repository switched name. It's now: https://github.com/openml/OpenML/blob/master/openml_OS/views/pages/api_new/v1/xsd/openml.data.upload.xsd
I hope that is the exact same XSD file :)
@FlorianPargent I think I fixed it, could you check if everything you need works now without errors?
I reuploaded my dataset but still got the following warning:
In if (!is.na(val)) newXMLNode(name, as.character(val), parent = parent, :
Bedingung hat Länge > 1 und nur das erste Element wird benutzt
(Sorry for the German error message...)
Unfortunately it still does not work:
library(OpenML)
dat = getOMLDataSet(41242)
dat$desc$ignore.attribute
still returns:
[1] "FL_DATECRS_DEP_TIME"
also,
task = convertOMLDataSetToMlr(dat)
returns
Fehler in makeSupervisedTask("regr", data, target, weights, blocking, coordinates, :
Column names of data doesn't contain target var: ARR_DELAY
although
"ARR_DELAY" %in% names(dat$data)
[1] TRUE
> devtools::session_info()
Session info ----------------------------------------------------------------------------------------------------------------------------------------
setting value
version R version 3.5.1 (2018-07-02)
system x86_64, darwin15.6.0
ui RStudio (1.1.456)
language (EN)
collate de_DE.UTF-8
tz Europe/Berlin
date 2018-11-07
Packages --------------------------------------------------------------------------------------------------------------------------------------------
package * version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0)
backports 1.1.2 2017-12-13 CRAN (R 3.5.0)
base * 3.5.1 2018-07-05 local
BBmisc 1.11 2018-11-07 Github (berndbischl/BBmisc@a5a4e45)
bindr 0.1.1 2018-03-13 CRAN (R 3.5.0)
bindrcpp 0.2.2 2018-03-29 CRAN (R 3.5.0)
checkmate 1.8.5 2017-10-24 CRAN (R 3.5.0)
colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0)
compiler 3.5.1 2018-07-05 local
crayon 1.3.4 2017-09-16 CRAN (R 3.5.0)
curl 3.2 2018-03-28 CRAN (R 3.5.0)
data.table 1.11.8 2018-09-30 CRAN (R 3.5.0)
datasets * 3.5.1 2018-07-05 local
devtools 1.13.6 2018-06-27 CRAN (R 3.5.0)
digest 0.6.18 2018-10-10 CRAN (R 3.5.0)
dplyr 0.7.7 2018-10-16 CRAN (R 3.5.0)
farff 1.0 2018-10-30 Github (mlr-org/farff@2e911b7)
fastmatch 1.1-0 2017-01-28 CRAN (R 3.5.0)
ggplot2 3.1.0 2018-10-25 cran (@3.1.0)
git2r 0.23.0 2018-07-17 CRAN (R 3.5.0)
glue 1.3.0 2018-07-17 CRAN (R 3.5.0)
graphics * 3.5.1 2018-07-05 local
grDevices * 3.5.1 2018-07-05 local
grid 3.5.1 2018-07-05 local
gtable 0.2.0 2016-02-26 CRAN (R 3.5.0)
hms 0.4.2 2018-03-10 CRAN (R 3.5.0)
httr 1.3.1 2017-08-20 CRAN (R 3.5.0)
jsonlite 1.5 2017-06-01 CRAN (R 3.5.0)
knitr 1.20 2018-02-20 CRAN (R 3.5.0)
lattice 0.20-35 2017-03-25 CRAN (R 3.5.1)
lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0)
magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
Matrix 1.2-14 2018-04-13 CRAN (R 3.5.1)
memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
methods * 3.5.1 2018-07-05 local
mlr * 2.13.9000 2018-11-07 Github (mlr-org/mlr@f28c937)
munsell 0.5.0 2018-06-12 CRAN (R 3.5.0)
OpenML * 1.9 2018-11-07 Github (openml/openml-r@316cb8a)
parallel 3.5.1 2018-07-05 local
parallelMap 1.4 2018-11-07 Github (berndbischl/parallelMap@101b91d)
ParamHelpers * 1.11 2018-11-07 Github (berndbischl/ParamHelpers@0516926)
pillar 1.3.0 2018-07-14 CRAN (R 3.5.0)
pkgconfig 2.0.2 2018-08-16 CRAN (R 3.5.0)
plyr 1.8.4 2016-06-08 CRAN (R 3.5.0)
purrr 0.2.5 2018-05-29 CRAN (R 3.5.0)
R6 2.3.0 2018-10-04 CRAN (R 3.5.0)
Rcpp 0.12.19 2018-10-01 CRAN (R 3.5.0)
readr * 1.1.1 2017-05-16 CRAN (R 3.5.0)
rlang 0.3.0.1 2018-10-25 cran (@0.3.0.1)
rstudioapi 0.8 2018-10-02 CRAN (R 3.5.0)
scales 1.0.0 2018-08-09 CRAN (R 3.5.0)
splines 3.5.1 2018-07-05 local
stats * 3.5.1 2018-07-05 local
stringi 1.2.4 2018-07-20 CRAN (R 3.5.0)
survival 2.42-3 2018-04-16 CRAN (R 3.5.1)
tibble 1.4.2 2018-01-22 CRAN (R 3.5.0)
tidyselect 0.2.5 2018-10-11 CRAN (R 3.5.0)
tools 3.5.1 2018-07-05 local
utils * 3.5.1 2018-07-05 local
withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
XML 3.98-1.16 2018-08-19 CRAN (R 3.5.0)
yaml 2.2.0 2018-07-25 CRAN (R 3.5.0)
I also have a related question: Where in the XSD files do I find the information which data types are actually required for the column names and data values in my datasets. As I understand it, the XSD file linked above only refers to the meta features in the dataset description but not to the dataset itself, right?
argh, forgot to fix another thing... It was like https://g.redditmedia.com/N5tZYbhFstt1m6z6zLSg14yiRT3RikeLvp48Z4lp1lo.gif?fm=mp4&mp4-fragmented=false&s=c80db07159d4a388a03844b4cd541e41
Hope it works now. Added a unit test that checks the whole process and it looked ok, i.e.:
iris = mlr::getTaskData(iris.task)
desc = makeOMLDataSetDescription(
name = "iris",
description = "iris with ignored features Sepal.Width and Petal.Length",
ignore.attribute = c("Sepal.Width", "Petal.Length"),
default.target.attribute = "Species"
)
d = makeOMLDataSet(desc, data = iris)
did = uploadOMLDataSet(d)
d2 = getOMLDataSet(did)
convertOMLDataSetToMlr(d2)
Regarding your other question: Did you look at the .arff file that is uploaded? It contains also some further meta-info about the data set itself.
Thanks a lot for fixing, it works now!
unfortunately, there is a problem now with downloading datasets when no feature is specified in ignore.attribute:
library(OpenML)
library(data.table)
loadOMLConfig(path = "~/.openml/config", assign = TRUE)
dat = data.table(a = rnorm(100), b = rnorm(100), c = rnorm(100))
desc = makeOMLDataSetDescription(
name = "test-ignore-bug",
description = "this is just to test a bug",
default.target.attribute = "a"
)
# create OML dataset
oml_dat = makeOMLDataSet(
desc = desc,
data = dat,
colnames.old = colnames(dat),
colnames.new = colnames(dat),
target.features = "a")
id = uploadOMLDataSet(oml_dat)
dat2 = getOMLDataSet(id)
deleteOMLObject(id, object = "data")
gives:
Fehler in getOMLDataSet(id) :
Assertion on 'desc$ignore.attribute' failed: Must be a subset of {'a','b','c','NA'}.
ok, I introduced a bug since I thought it would be a good idea to do stricter arg-checks. But this causes many other issues, especially with already uploaded data sets (and wrong/missing field names). Should work now again.
What is the correct way to specify more than one feature to be ignored by OpenML in modeling?
The documentation of makeOMLDataSetDescription() says:
Using a character vector with length > 1 does not work although the dataset will be uploaded. This can be seen with the following dataset I uploaded, where I used ignore.attribute = c("FL_DATE", "CRS_DEP_TIME"):
dat = getOMLDataSet(41209)
dat$desc$ignore.attribute