openml / openml-data

For tracking issues related to OpenML datasets
1 stars 1 forks source link

Sarcos data (44976) contains duplicates #61

Open sebffischer opened 11 months ago

sebffischer commented 11 months ago

There are (my mistake) unfortunately two versions of the sarcos data, i.e. https://www.openml.org/search?type=data&status=active&id=44976 and https://www.openml.org/search?type=data&status=active&id=43873.

The first contains the duplicates from the test set, while the latter does not. Also the first was accidentally used by the CTR-23

PGijsbers commented 11 months ago

What do you suggest to do? You own 44976, so you could choose to deactivate it and instead link 43873 to CTR-23.

sebffischer commented 11 months ago

No, I think the suite should stay as it is. Matthias suggested that I add an issue here. When we do a new version of the CTR23 this should just be corrected I guess. Do you think I should then close this issue?

PGijsbers commented 11 months ago

You could also choose to deactivate the dataset but keep it in the suite. Direct downloading is unaffected by it being deactivated, but it should function as a clear signals to others not to use it, and it won't show up when listing datasets with an active filter. If you do experience issues, deactivating a dataset can be reversed (by administrators-but you know how to reach us :)).