openml / openml-r

R package to interface with OpenML
http://openml.github.io/openml-r/
Other
95 stars 37 forks source link

a question regarding dataset tag #421

Closed BayanIbra closed 5 years ago

BayanIbra commented 5 years ago

Is the dataset tags in each dataset page represent the tags in this function listOMLRunEvaluations? I am trying to get results for tag = uci using listOMLRunEvaluations function but cannot return anything. Are the added tags in each dataset page in the website are updated in the function ?

please advice,

PhilippPro commented 5 years ago

Hi! No this is a different thing. You can get the datasets of uci by a = listOMLDataSets(tag = "uci")

You can give tags when you run experiments. E.g. you can get results that you uploaded, when you gave a tag when uploading the experiment.

Here you can see an example: https://github.com/ja-thomas/OMLbots/blob/master/HowToWriteABot.Rmd

You can get e.g. some runs, that I uploaded: my_runs = listOMLRunEvaluations(tag = "mysimpleBot")

giuseppec commented 5 years ago

Ok, it is very inconvenient but you could do this:

# get all uci data sets
ds = listOMLDataSets(tag = "uci")
# get all classification tasks where 10-fold CV is used to estimate the performance
tasks = listOMLTasks(task.type = "Supervised Classification", estimation.procedure = "10-fold Crossvalidation")
# subset those tasks so that you only have tasks based on uci data sets
tasks = tasks[tasks$data.id %in% ds$data.id, ]

# note that there can still be multiple tasks for each data set (you probably want only one)
table(tasks$name)

# get results using the task id (increase the total.limit to get more results)
res = chunkOMLlist("listOMLRunEvaluations", 
  task.id = tasks$task.id, 
  evaluation.measure = "predictive_accuracy", 
  total.limit = 100000)

I've already proposed a server change here https://github.com/openml/OpenML/issues/530

giuseppec commented 5 years ago

Ah you can do it better, you can use the data.tag argument from listOMLTasks

tasks = listOMLTasks(data.tag = "uci", task.type = "Supervised Classification", 
  estimation.procedure = "10-fold Crossvalidation")

# note that there can still be multiple tasks for each data set (you probably want only one task per data)
table(tasks$name)

# get results using the task id (increase the total.limit to get more results)
res = chunkOMLlist("listOMLRunEvaluations", 
  task.id = tasks$task.id, 
  evaluation.measure = "predictive_accuracy", 
  total.limit = 100000)
BayanIbra commented 5 years ago

That's great thank you both @PhilippPro , @giuseppec for your responses but what if I want the other evaluation measures not just predictive accuracy (such as area_under_roc_curve).
Can I get thisusing listOMLTasks?

Thanks again, Bayan

giuseppec commented 5 years ago

Ok, I think this requires some clarification (it is really a bit confusing). Here my attempt to explain this: I wouldn't do this via listOMLTask as you would then obtain less results if you pick the "wrong" task (see, e.g. table(tasks$evaluation.measures)).
The evaluation.measure stored in tasks can be seen as the "default measure" or "suggested measure" on which the performance of competing algorithms should be evaluated (this is something the guy who created the task just selected). For example, look at the number of runs for the two tasks https://www.openml.org/t/2 and https://www.openml.org/t/145952. Both tasks are based on the anneal data. However, the first one uses the predictive_accuracy and the second one uses the precision. In general, you could simply merge the runs of those two tasks, since OpenML computes all evaluation measures on all measures anyway (maybe both task use different train-test splits).

Anyway, here the answer to your question. I would suggest to do the following for each measrue you are interested in, separately (and merge the data frames afterwards):

# get results using the task id (increase the total.limit to get more results)
res.acc = chunkOMLlist("listOMLRunEvaluations", 
  task.id = tasks$task.id, 
  evaluation.measure = "predictive accuracy", 
  total.limit = 1000)
res.auc = chunkOMLlist("listOMLRunEvaluations", 
  task.id = tasks$task.id, 
  evaluation.measure = "area_under_roc_curve", 
  total.limit = 1000)

And then just join the res.acc and res.auc.

giuseppec commented 5 years ago

I guess this issue can be closed here. If not then just reopen.