Input data in CSV but it seems I need in ARFF if I want autoxgboost, mlr3automl, ranger

Derived from discussion in https://github.com/openml/automlbenchmark/pull/450

I've done with example test first:

yes | python runbenchmark.py autoxgboost:latest example test -m docker -s force

# and

yes | python runbenchmark.py tpot example test -m docker -s force

Both worked and both seem to use the same input data from:

~/cache/openml/org/openml/www/datasets/31/
dataset.arff
dataset.pkl.py3
dataset.pq
dataset_test_0.arff
dataset_test_1.arff
dataset_train_0.arff
dataset_train_1.arff
description.xml
features.xml
features.xml.pkl

My personal input data is only in CSV and I was able to run it successfully against 13 out 19 frameworks.

The usual cmd is:

yes | python3 runbenchmark.py autoxgboost automl_config_docker 1h4c -m docker -i . -s force

and it will fail for autoxgboost with:

CalledProcessError: Command 'Rscript --vanilla -e ".libPaths('/bench/frameworks/autoxgboost/lib'); source('/bench/frameworks/autoxgboost/exec.R'); run('/input/test_data/differentiate_cancer_train.csv…

More specifically:

...
Parse with reader=readr : /input/test_data/differentiate_cancer_train.csv
Error in parseHeader(path) :
  Invalid column specification line found in ARFF header:
f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,f_10,f_11,f_12,f_13,f_14,f_15,f_16,f_17,f_18,f_19,f_20,f_21,f_22,f_23,f_24,f_25,f_26,f_27,f_28,f_29,f_30,f_31,f_32,f_33,f_34,f_35,f_36,f_37,f_38,f_39,f_40,f_41,f_42,f_43,f_44,f_45,f_46,f_47,f_48,f_49,f_50,...

Searching around and I found out this https://machinelearningmastery.com/load-csv-machine-learning-data-weka/

Yet it's about Weka, but it makes me think if my data need to be converted anyway. Now I'm wondering how can I do it.

BTW, frameworks ranger and mlr3automl failed in the same way.

My input data is a CSV table with heads like f_1,f_2,...,f_4096,target, 4097 cols with 50 rows of floats.

automl_config_docker:

---
#for doc purpose using <placeholder:default_value> syntax when it applies.

#FORMAT: global defaults are defined in config.yaml
- name: __dummy-task
  enabled: false # actual default is `true` of course...
  openml_task_id: 0
  metric: # the first metric in the task list will be optimized against and used for the main result, the other ones are optional and purely informative. Only the metrics annotated with (*) can be used as a performance metric.
    -  # classification
    - acc # (*) accuracy
    - auc # (*) array under curve
    - logloss # (*) log loss
    - f1 # F1 score
    -  # regression
    - mae # (*) mean absolute error
    - mse # (*) mean squared error
    - rmse # root mean squared error
    - rmsle # root mean squared log error
    - r2 # R^2 score
  folds: 1
  max_runtime_seconds: 1200
  cores: 1
  max_mem_size_mb: -1
  ec2_instance_type: m5.large

# local defaults (applying only to tasks defined in this file) can be defined in a task named "__defaults__"
- name: __defaults__
  folds: 1
  cores: 4
  max_runtime_seconds: 400

- name: teddata
  dataset:
    train: /input/test_data/differentiate_cancer_train.csv
    test: /input/test_data/differentiate_cancer_test.csv
    type: binary
    target: target
  folds: 1

and the only change in resources/config.yaml is to use python: 3.8.

openml / automlbenchmark

Input data in CSV but it seems I need in ARFF if I want autoxgboost, mlr3automl, ranger #498