openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
669 stars 91 forks source link

Task splits as parquet files #1162

Open sebffischer opened 2 years ago

sebffischer commented 2 years ago

Are there plans to also provide the task splits as parquet files in the future? This would allow us to remove the arff dependencies (once all the datasets are successfully migrated).

As an example wrt to the storage size, here the file-size of the NYC taxi dataset in parquet and arff.

library(mlr3oml)
library(duckdb)
#> Loading required package: DBI

otask = OMLTask$new(359943)
task_splits = otask$task_splits
#> INFO  [12:21:06.213] Retrieving JSON {url: `https://www.openml.org/api/v1/json/task/359943`, authenticated: `TRUE`}
#> INFO  [12:21:06.955] Retrieving ARFF {url: `https://api.openml.org//api_splits/get/359943/Task_359943_splits.arff`, authenticated: `TRUE`}

file_arff = tempfile(fileext = ".arff")
file_parquet = tempfile(fileext = ".parquet")

con = DBI::dbConnect(duckdb::duckdb())
DBI::dbWriteTable(con, "tbl", task_splits, row.names = FALSE)
DBI::dbExecute(con, sprintf("COPY tbl TO '%s' (FORMAT 'PARQUET', CODEC 'ZSTD') ", file_parquet))
#> [1] 5818350
mlr3oml::write_arff(task_splits, file_arff)

file.size(file_parquet) / file.size(file_arff)
#> [1] 0.1619774

Created on 2022-08-30 by the reprex package (v2.0.1)

joaquinvanschoren commented 2 years ago

Yes, that is the plan :).