How to specify more than one feature in "ignore.attribute" when uploading a dataset? #423

FlorianPargent commented 5 years ago

What is the correct way to specify more than one feature to be ignored by OpenML in modeling?

The documentation of makeOMLDataSetDescription() says:

ignore.attribute [character(1)] Attributes that should be excluded in modelling, such as identifiers and indexes. Optional.

Using a character vector with length > 1 does not work although the dataset will be uploaded. This can be seen with the following dataset I uploaded, where I used ignore.attribute = c("FL_DATE", "CRS_DEP_TIME"):

dat = getOMLDataSet(41209) dat$desc$ignore.attribute


giuseppec commented 5 years ago

@janvanrijn , @joaquinvanschoren apparently it is not possible to upload a data set by adding multiple feature into the ignore_attribute field? I tried to add the features that should be ignored as comma-separated in the XML, e.g.,
<oml:ignore_attribute>index,col_1,col_2,col_3</oml:ignore_attribute>. However, I get Problem validating uploaded description file XML does not correspond to XSD schema.

Btw., I do not find the XSD schema anymore, it used to be here now it's gone.

joaquinvanschoren commented 5 years ago

Hi Giuseppe, the correct XSD is in the API docs:!/data/post_data

Direct link:

janvanrijn commented 5 years ago

Repository switched name. It's now:

joaquinvanschoren commented 5 years ago

I hope that is the exact same XSD file :)

giuseppec commented 5 years ago

@FlorianPargent I think I fixed it, could you check if everything you need works now without errors?

FlorianPargent commented 5 years ago

I reuploaded my dataset but still got the following warning:

In if (! newXMLNode(name, as.character(val), parent = parent,  :
  Bedingung hat Länge > 1 und nur das erste Element wird benutzt

(Sorry for the German error message...)

Unfortunately it still does not work:

dat = getOMLDataSet(41242)

still returns: [1] "FL_DATECRS_DEP_TIME"


task = convertOMLDataSetToMlr(dat)


Fehler in makeSupervisedTask("regr", data, target, weights, blocking, coordinates,  : 
  Column names of data doesn't contain target var: ARR_DELAY


"ARR_DELAY" %in% names(dat$data)
[1] TRUE
FlorianPargent commented 5 years ago

I also have a related question: Where in the XSD files do I find the information which data types are actually required for the column names and data values in my datasets. As I understand it, the XSD file linked above only refers to the meta features in the dataset description but not to the dataset itself, right?

giuseppec commented 5 years ago

argh, forgot to fix another thing... It was like

Hope it works now. Added a unit test that checks the whole process and it looked ok, i.e.:

    iris = mlr::getTaskData(iris.task)
    desc = makeOMLDataSetDescription(
      name = "iris",
      description = "iris with ignored features Sepal.Width and Petal.Length",
      ignore.attribute = c("Sepal.Width", "Petal.Length"), = "Species"
    d = makeOMLDataSet(desc, data = iris)
    did = uploadOMLDataSet(d)
    d2 = getOMLDataSet(did)

Regarding your other question: Did you look at the .arff file that is uploaded? It contains also some further meta-info about the data set itself.

FlorianPargent commented 5 years ago

Thanks a lot for fixing, it works now!

FlorianPargent commented 5 years ago

unfortunately, there is a problem now with downloading datasets when no feature is specified in ignore.attribute:


loadOMLConfig(path = "~/.openml/config", assign = TRUE)

dat = data.table(a = rnorm(100), b = rnorm(100), c = rnorm(100))

desc = makeOMLDataSetDescription(
  name = "test-ignore-bug",
  description = "this is just to test a bug", = "a"

# create OML dataset
oml_dat = makeOMLDataSet(
  desc = desc,
  data = dat,
  colnames.old = colnames(dat), = colnames(dat), 
  target.features = "a")

id = uploadOMLDataSet(oml_dat)

dat2 = getOMLDataSet(id)

deleteOMLObject(id, object = "data")


Fehler in getOMLDataSet(id) : 
  Assertion on 'desc$ignore.attribute' failed: Must be a subset of {'a','b','c','NA'}.
giuseppec commented 5 years ago

ok, I introduced a bug since I thought it would be a good idea to do stricter arg-checks. But this causes many other issues, especially with already uploaded data sets (and wrong/missing field names). Should work now again.