openml / openml-python

OpenML's Python API for a World of Data and More 💫
http://openml.github.io/openml-python/
Other
280 stars 144 forks source link

Proposal: Use pandas str type for str datasets #1107

Open mfeurer opened 3 years ago

mfeurer commented 3 years ago

Since pandas 1.0 there is an explicit string data type which replaces the object datatype: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html

I suggest we use it for datasets containing string features as it is more descriptive and the suggest way of representing strings in pandas.

This would for example make the Titanic dataset dtypes much more descriptive. Right now they are:

pclass          uint8
survived     category
name           object
sex          category
age           float64
sibsp           uint8
parch           uint8
ticket         object
fare          float64
cabin          object
embarked     category
boat           object
body          float64
home.dest      object

and they would be

pclass          uint8
survived     category
name           string
sex          category
age           float64
sibsp           uint8
parch           uint8
ticket         string
fare          float64
cabin          string
embarked     category
boat           string
body          float64
home.dest      string
mfeurer commented 2 years ago

I just had a short look into this and it turns out this is harder than originally anticipated due to the following reasons:

  1. We currently only distinguish between categorical and numerical features in the internal data loading -> need to extend this
  2. We currently cache this boolean array mentioned in 1. -> need to update what's cached and potentially add something like a cache format version number
  3. There's loading from feather, arff and parquet -> just a bunch of work for feather and arff, not sure about parquet. Is parquet support developed further?
PGijsbers commented 2 years ago

Thanks for looking into this. Re 3: I think parquet should become the preferred format as openml-python matures its parquet usage.