Open mfeurer opened 3 years ago
I just had a short look into this and it turns out this is harder than originally anticipated due to the following reasons:
Thanks for looking into this. Re 3: I think parquet
should become the preferred format as openml-python
matures its parquet
usage.
Since pandas 1.0 there is an explicit string data type which replaces the object datatype: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
I suggest we use it for datasets containing string features as it is more descriptive and the suggest way of representing strings in pandas.
This would for example make the Titanic dataset dtypes much more descriptive. Right now they are:
and they would be