openml / openml-python

Python module to interface with OpenML
https://openml.github.io/openml-python/main/
Other
276 stars 142 forks source link

Be more lenient with feature type when determining class labels #1311

Closed PGijsbers closed 6 months ago

PGijsbers commented 6 months ago

For a supervised classification task, the task.class_labels is determined automatically here:

https://github.com/openml/openml-python/blob/326bf0b877696cbb1004a173b0b2fe0e09557e24/openml/datasets/dataset.py#L911-L913

Sometimes people are not very meticulous when creating datasets, and the feature type may be listed as string instead of nominal, which means that task.class_labels will be None. A simple work-around would be to add a case where feature.data_type == 'string' and then fetch the unique values from the column. It might be worth it to encourage users to fix the feature type of the dataset, but unfortunately the only way to do that is 1) being the dataset owner or 2) creating an entirely new version of the dataset (and thus also requires a new task).

We should consider giving a warning, maybe, but honestly this probably should be fixed on task creation (i.e., say that the target is invalid for a classification task if the feature type is string and not nominal).

PGijsbers commented 6 months ago

@LennartPurucker it would also be really useful if this can be in the next release. Something simple like

elif (feature.name == target_name) and (feature.data_type == "string")
  df, *_ = self.get_data()
  return list(df.loc[feature.name].unique())

I don't recall the attribute names from the top of my head but this is close enough :)

LennartPurucker commented 6 months ago

I pivoted to not throwing a warning, as we can safely interpret the string as nominal. Moreover, as you mentioned, this should be fixed at the task creation level / on the server side so that something like this cannot happen anymore or is fixed after the fact.