openml / openml-data

For tracking issues related to OpenML datasets
1 stars 1 forks source link

Sarcos data (44976) contains duplicates #61

Open sebffischer opened 7 months ago

sebffischer commented 7 months ago

There are (my mistake) unfortunately two versions of the sarcos data, i.e. https://www.openml.org/search?type=data&status=active&id=44976 and https://www.openml.org/search?type=data&status=active&id=43873.

The first contains the duplicates from the test set, while the latter does not. Also the first was accidentally used by the CTR-23

PGijsbers commented 7 months ago

What do you suggest to do? You own 44976, so you could choose to deactivate it and instead link 43873 to CTR-23.

sebffischer commented 7 months ago

No, I think the suite should stay as it is. Matthias suggested that I add an issue here. When we do a new version of the CTR23 this should just be corrected I guess. Do you think I should then close this issue?

PGijsbers commented 7 months ago

You could also choose to deactivate the dataset but keep it in the suite. Direct downloading is unaffected by it being deactivated, but it should function as a clear signals to others not to use it, and it won't show up when listing datasets with an active filter. If you do experience issues, deactivating a dataset can be reversed (by administrators-but you know how to reach us :)).