openml / automlbenchmark

OpenML AutoML Benchmarking Framework
MIT License
391 stars 130 forks source link

Input data in CSV but it seems I need in ARFF if I want autoxgboost, mlr3automl, ranger #498

Closed alanwilter closed 1 year ago

alanwilter commented 1 year ago

Derived from discussion in

I've done with example test first:

yes | python autoxgboost:latest example test -m docker -s force

# and

yes | python tpot example test -m docker -s force

Both worked and both seem to use the same input data from:


My personal input data is only in CSV and I was able to run it successfully against 13 out 19 frameworks.

The usual cmd is:

yes | python3 autoxgboost automl_config_docker 1h4c -m docker -i . -s force

and it will fail for autoxgboost with:

CalledProcessError: Command 'Rscript --vanilla -e ".libPaths('/bench/frameworks/autoxgboost/lib'); source('/bench/frameworks/autoxgboost/exec.R'); run('/input/test_data/differentiate_cancer_train.csv…

More specifically:

Parse with reader=readr : /input/test_data/differentiate_cancer_train.csv
Error in parseHeader(path) :
  Invalid column specification line found in ARFF header:

Searching around and I found out this

Yet it's about Weka, but it makes me think if my data need to be converted anyway. Now I'm wondering how can I do it.

BTW, frameworks ranger and mlr3automl failed in the same way.

My input data is a CSV table with heads like f_1,f_2,...,f_4096,target, 4097 cols with 50 rows of floats.


#for doc purpose using <placeholder:default_value> syntax when it applies.

#FORMAT: global defaults are defined in config.yaml
- name: __dummy-task
  enabled: false # actual default is `true` of course...
  openml_task_id: 0
  metric: # the first metric in the task list will be optimized against and used for the main result, the other ones are optional and purely informative. Only the metrics annotated with (*) can be used as a performance metric.
    -  # classification
    - acc # (*) accuracy
    - auc # (*) array under curve
    - logloss # (*) log loss
    - f1 # F1 score
    -  # regression
    - mae # (*) mean absolute error
    - mse # (*) mean squared error
    - rmse # root mean squared error
    - rmsle # root mean squared log error
    - r2 # R^2 score
  folds: 1
  max_runtime_seconds: 1200
  cores: 1
  max_mem_size_mb: -1
  ec2_instance_type: m5.large

# local defaults (applying only to tasks defined in this file) can be defined in a task named "__defaults__"
- name: __defaults__
  folds: 1
  cores: 4
  max_runtime_seconds: 400

- name: teddata
    train: /input/test_data/differentiate_cancer_train.csv
    test: /input/test_data/differentiate_cancer_test.csv
    type: binary
    target: target
  folds: 1

and the only change in resources/config.yaml is to use python: 3.8.

alanwilter commented 1 year ago

I can confirm this, after converting my input data from CSV to ARFF and now all these 3 frameworks worked.