openml / openml-python

Python module to interface with OpenML
https://openml.github.io/openml-python/main/
Other
279 stars 143 forks source link

Automatically load dataset as pandas #1251

Open mfeurer opened 1 year ago

mfeurer commented 1 year ago

This issue is a proposal that we (1) load datasets as pandas by default and (2) rewrite the dataset loader to be pandas by default and convert to numpy if the user requests a numpy array.

The reasons for this proposal are:

  1. pandas is much more stable as it used to be a few years ago when we started this project and can now also properly handle strings (see #1107).
  2. pandas can properly encode categorical columns, which can make it easier for projects building on OpenML-Python to handle these categories.
  3. We will use parquet in the background to store files anyway, which has to be interfaced with pandas.