openml / automlbenchmark

OpenML AutoML Benchmarking Framework
https://openml.github.io/automlbenchmark
MIT License
396 stars 132 forks source link

Keeping object dtype and string dtype #435

Open yinweisu opened 2 years ago

yinweisu commented 2 years ago

AMLB is converting object dtype and string dtype to category dtype: https://github.com/openml/automlbenchmark/blob/master/amlb/datasets/file.py#L310-L312

This would bring trouble when the framework requires a certain dtype to operate properly. For example, autogluon multimodality requires the text and image path column to be an object.

PGijsbers commented 2 years ago

I think instead it should be optional to convert string/object to categorical dtype (it's conceptually similar to how we can also provided encoded data).

mfeurer commented 2 years ago

Hi, just checking: this is only if the dataset is loaded as a CSV, right? Or put in different words: If a dataset is loaded from an arff file, this issue does not kick in?

PGijsbers commented 2 years ago

ARFF file dtypes should be determined by their header (and when they are loading through openml-python they are determined by openml's features).

sebhrusen commented 2 years ago

We started to support string dtype only quite recently in the openml loader, and the file loader hasn't been updated accordingly. But given the problematic raised by @yinweisu, we should probably think about exposing some cherry-picked configurations to the frameworks themselves. In this case, I think it's easy to externalize this dtype conversion map into config.yaml, from simplified dtype as detected by pandas to the dtypes needed by the framework, for example, current behaviour can be described as:

csv.dtype_conversion:
    string: category
    object: category

Then, remains the questions of:

  1. the best default behaviour: no conversion at all? in which case many frameworks may need to override this.
  2. how to make this configurable at framework level (ie. in frameworks.yaml) and not only in config.yaml: this is only a technical issue (loader should be aware of which framework is being used). I can imagine storing the framework currently in use in a thread local so that the config will automatically lookup the framework config namespace before the default one. This opens the door to a great amount of configurability attached to the framework itself, can be seen as a good thing or a bad one :)
Innixma commented 2 years ago

I'd recommend the default be no conversion unless there is an argument for conversion being beneficial. Most AutoML users will simply load their data from CSV with pd.read_csv(data_path) and throw that at AutoML. I think having the loading logic be as similar as possible to that scenario makes sense.

PGijsbers commented 2 years ago

The default of no conversion makes the most sense to me, given that we would add the functionality to configure this per framework (and add the configurations as necessary to stable and latest). But I want to avoid a scenario where only a few frameworks work out of the box, and all others need custom configurations to work.

I think that since we started this project many AutoML frameworks have improved their input handling. It's probably best to test how each framework handles string/object input, but I think a lot of frameworks can handle it these days (and quite a few side-step this by just loading the file directly).