[40706] Explanation of parity5_plus_5 and potential issues

amueller commented 9 months ago

Hey! So I'm trying to understand parity5_plus_5 and I'm a bit confused. It has 1124 rows, but the data counts up from 0 to 1023 in binary, so there are 100 duplicate rows. Is that on purpose? I know OpenML contains specific train/test splits and these might account for the duplication but a lot of people use the datasets without the splits, like me and @SamuelGabriel and @noahho.

The dataset is marked as "validated" and uploaded by @PGijsbers, so he might know more. It would also be great to have a dataset description.

Thanks!

ps: is there a way to change dataset version in the OpenML website right now? I'm not sure I'm looking at the most recent version cc @joaquinvanschoren

PGijsbers commented 9 months ago

I think the "verified" status only means that the file was processed correctly by the server. I believe it is "active" from the old website. Assuming you are referring to dataset 40706, I uploaded that dataset from the PMLB. Based on their documentation, they don't have explicit train/test splits. The Parity5+5 dataset they have also has the same issues and has no description. Therefor I would assume it was an error on their end. It would probably be good to open an issue on their repository, hopefully they can address the issue (or given an explanation) and also double-check their other datasets.

@joaquinvanschoren Why was "active" changed to "verified"? I think "verified" might give the impression there is some kind of quality control here.

joaquinvanschoren commented 9 months ago

I kept getting questions from people what 'active' means, and I always had to explain that it meant it was verified by some automated tests. If you have a better word for it, I'm happy to change it.

PGijsbers commented 9 months ago

If the only statuses are "in processing", "deactivated", or "active", why visibly show "active" (or "verified") on the website at all? When the dataset is "in processing" or "deactivated", the user should be informed, but for the expected status ("active") I don't think we need to show additional confirmation to the user.

joaquinvanschoren commented 9 months ago

It's also a filter option, so I guess we should have an intuitive name for all non-in-preparation, non-deactivated datasets. For 'de-activated' I think 'deprecated' is a better word.

amueller commented 9 months ago

hm maybe "valid"? Though I think active is not so bad. @PGijsbers do you want to follow up with them or should I?

amueller commented 9 months ago

Actually, I'm not sure what the dataset is. It clearly counts binary numbers, but I'm not sure if the left-most is 2^0 or if the right-most is 2^0. So there's at least two ways to decode it to an integer. But I don't see how to get from that integer to the class.

PGijsbers commented 9 months ago

Based on the name would assume it's two 5-bit integers which then get added? But even then I wouldn't know how to construct the class. If you have the time to follow up with PMLB, I would appreciate that a lot :)

amueller commented 9 months ago

I have a solution to the dataset, but no explanation:

data['class'] == data[['Bit_2', 'Bit_3', 'Bit_4', 'Bit_6', 'Bit_8']].sum(axis=1) % 2

I opened https://github.com/EpistasisLab/pmlb/issues/179

amueller commented 9 months ago

I think the "plus" refers to the fact that there are just 5 bits that are noise and are being ignored. The dataset is solvable (as opposed to parity5, which doesn't seem solvable without extra knowledge) because once the model figured out which columns to ignore, there are duplicates. So it's a dataset checking feature selection.

amueller commented 8 months ago

The original paper apparently doesn't mention the duplicate rows: https://github.com/EpistasisLab/pmlb/issues/179#issuecomment-1775955190

PGijsbers commented 8 months ago

Thanks for getting in touch and letting us know! I guess we can keep this version active with some kind of notice, and make a newer version of this dataset with duplicate rows removed?

amueller commented 8 months ago

That was my plan, though it's not gonna be that useful until dataset versions are visible again: https://github.com/openml/openml.org/issues/95

openml / openml-data

[40706] Explanation of parity5_plus_5 and potential issues #54