Unit tests fail due to switch to Parquet

PGijsbers commented 2 years ago

I wanted to investigate which unit tests fail and why (and how to get them working again). One of the changes that broke multiple tests was the switch from ARFF files to Parquet. For example, a dataset which had a boolean value in ARFF would have be a categorical type with two allowed values - in parquet they simply are a pandas boolean type. In another example, see below, the Parquet had stored numeric data as float64 but our tests expect them to be uint8 (because there are no missing values and casting the column to uint8 loses no data).

How do we proceed with this? I propose updating the unit tests/expected behavior where reasonable (for example, integrating the bool dtype instead of holding on to a two-valued categorical). Sometimes it might make sense to instead expect changes to the parquet file (as the example in the image), but at the same time I don't know if the parquet readers in all languages can deal with the different types. Either way we can also update our parquet loading logic to test if type conversions are possible, so that our unit tests should be robust to the changes.

side note: I am not entirely sure why the data is stored in float64 when AFAIK the openml-python module was used to convert the data..

mfeurer commented 2 years ago

Hi, I deciding this on a case-by-case basis makes sense. Answers about the existing cases:

How do we know that these should be bools and not categoricals? Since the data was uploaded as arff, we can only guess this I assume. Also, this somewhat changes the OpenML dataset definition by allowing more data types, so it really needs to be defined on an OpenML level before we can implement something in Python.
I think @amueller implemented the uint8 data type casting to save space when pickling a numpy array on disk. If we no longer do that, and depending on how much caching is still needed, we might be able to drop this downcasting and always use a commonly used format (float64)

PGijsbers commented 2 years ago

Since the data was uploaded as arff, we can only guess this I assume.

Yes, but we already converted the stringy "True" and "False" to boolean True and False.

and depending on how much caching is still needed

I'll look into that

mfeurer commented 2 years ago

Yes, but we already converted the stringy "True" and "False" to boolean True and False.

You mean in Python? If that's the case, we should continue doing this I guess.

sebffischer commented 2 years ago

It would be great this whole conversion could be properly documented properly somewhere, because it makes it somewhat tricky to migrate.

Problems I have encountered are:

different data types such as the character --> bool conversion. I guess this cannot be avoided but maybe it would be good to make a summary somewhere where we can document all those conversions that are happening
different column names which really should not happen I think

PGijsbers commented 2 years ago

To be clear, are you talking specifically about how openml-python handles ARFF to pandas.DataFrame? Or about how the conversion of ARFF to parquet was done server-side?

sebffischer commented 2 years ago

I was talking about the conversion from arff to parquet that was done on the server side, not the openml-python package. I just mentioned it here because you already had a related discussion going on.

I think the naming bug was on the side of the R api, but i found a case where a column seems to have disappeared in the parquet version of a dataset: https://github.com/openml/OpenML/issues/1165

openml / openml-python

Unit tests fail due to switch to Parquet #1157