openml / openml-data

For tracking issues related to OpenML datasets
1 stars 1 forks source link

binarize pm10 is not binarized on the target attribute #58

Open amueller opened 9 months ago

amueller commented 9 months ago

The binarized version of pm10 has as target binaryClass, which seems to be a discretization of day from the version 1 of the dataset. The target, however, is pm10_concentration and trying to predict day seem a bit ... strange? At least that should be mentioned.

Btw, scikit-learn currently provides the first active version of a dataset by default, while openml.org seems to show the latest version of the dataset. I'm not sure if there's currently a way to show what versions are available for a given dataset.

ping @joaquinvanschoren who seems to have uploaded the binarized dataset.

It looks to me like the data was collected over two years (only in winter I think as there's a gap in days), and in the binarized dataset the task is to figure out whether the measurement is from the first or the second year of data collection. I don't think that was the intention of the dataset creator.