openml / openml-data

For tracking issues related to OpenML datasets
1 stars 1 forks source link

Migration ARFF to Parquet on the OpenML server #50

Open PGijsbers opened 2 years ago

PGijsbers commented 2 years ago

This is a centralised discussion about the server side changes (being) made to the datasets in their conversion from ARFF to Parquet. Related on-going discussions that reference the server state of different datasets:

Let's keep the relevant information about the migration as it relates to server data in this thread. This is not for connector specific discussions (for example, how openml-python handles this). @joaquinvanschoren @prabhant @sebffischer

sebffischer commented 2 years ago

A couple of remarks:

sebffischer commented 2 years ago

Also it would be great if the types string and categorical were distinguishable from looking at the data features (if this information is somehow updated using the new parquet files)

sebffischer commented 2 years ago

Also it seems that the parquet urls from the test server are wrong. With wrong I mean that they point to the parquet urls of the publis server.

Edit: more info

PGijsbers commented 1 year ago

Also it would be great if the types string and categorical were distinguishable from looking at the data features (if this information is somehow updated using the new parquet files)

Do you have an example? We want to see if this issue was with the ARFF file or specifically introduced in the conversion.

PGijsbers commented 1 year ago

Also it seems that the parquet urls from the test server are wrong.

Parquet URLS from the test server have been disabled for now, until we have a separate minio (or bucket) for the test server.

PGijsbers commented 1 year ago

https://github.com/openml/OpenML/issues/1165 this is kind of a weird bug

We'll look into that, and for the conversion scripts we'll have a closer look to preserve the feature data, or encode it into correct data types where ARFF was previously not expressive enough (e.g., boolean, 8-bit integers).

sebffischer commented 1 year ago

This was a confusion from my side, sorry!

PGijsbers commented 1 year ago

The following will be changed for the conversion script:

Additionally feature meta-data needs to be updated: