openml / openml-data

For tracking issues related to OpenML datasets
1 stars 1 forks source link

cylinder-bands is leaking target #59

Open amueller opened 8 months ago

amueller commented 8 months ago

cylinder-bands is leaking the target via the job_number column. Similar to #57 I think this column should be ignored, unless this is intentional (which seems strange). This dataset is part of the CC-18, I wonder if there's a way to fix this.

Maybe a better way to address this would be to use grouped cross-validation, but that would mean that downstream benchmarks are aware and use the provided splits.

amueller commented 8 months ago

hm the description of CC-18 says "classification tasks on dense data set independent observations", independent observations seems a bit of a stretch in this case.