openml / openml-data

For tracking issues related to OpenML datasets
1 stars 1 forks source link

Automated checks for incorrect target column types (data quality bot) #34

Open ArlindKadra opened 4 years ago

ArlindKadra commented 4 years ago

Hey guys,

I was looking at the following dataset: https://www.openml.org/d/42175

It seems that the class feature is not given correctly as it should be nominal and not numerical. I was trying to create a classification task based on that and it was failing. I am guessing it is because it is not a nominal feature, however, no warning or error is given. It just fails.

PGijsbers commented 4 years ago

Were you trying to create the task through the web interface or a programmatic interface (and if so, which)?

ArlindKadra commented 4 years ago

It was through the web interface. Is it possible to change a feature type or does the dataset have to be reuploaded again ?

PGijsbers commented 4 years ago

I'm not sure. As we're moving towards a new web interface soon, we won't change the old task form now. However we should update this dataset.

amueller commented 4 years ago

There was a pretty simple query that showed that at least dozens if not hundreds of datasets had the wrong target type. There might be an issue for it in this repo but I can also try to figure out the query if you want to fix them.

amueller commented 4 years ago

This is related: #20 #18

ArlindKadra commented 4 years ago

Yes, that is what I was actually asking, if this fix could be done on the backend or if the datasets would have to be reuploaded, at least for the simple cases, that have only a target binary feature given as numerical.

Also related to #30

amueller commented 4 years ago

I think to change the target type you need a new version, right? That is sort-of the purpose of versions. Though actually I don't think any clear meaning of versions is defined since you create a new version just by reusing an existing name, so they could also be completely unrelated (and sometimes are).

ArlindKadra commented 4 years ago

well, yes, although in my opinion versions translate to a dataset that is possibly modified/cleaned where instances/features can be removed. In this case, the previous versions where the target type is numerical and not nominal, are not correct.

Also, not sure if I remember correctly, but I think the drop-down button on the right at the dataset page used to show the different versions of that dataset. If so, it seems to be disabled now or maybe not working anymore.

amueller commented 4 years ago

@ArlindKadra that has been broken for a while, I think @joaquinvanschoren said it's not worth fixing for the old website? And we can disable the old broken datasets if we add new, correct versions.

ArlindKadra commented 4 years ago

Ah, I see, thanks for the info @amueller . I reuploaded the dataset with the correct target feature: https://www.openml.org/d/42397 I will do this for a few more datasets. I think this "problem" will keep on happening and most users will not notice if they are using pandas to read csv files, since bool variables or categorical ones that are not strings will have a numerical type.

There might be an issue for it in this repo but I can also try to figure out the query if you want to fix them.

I guess, we could go through each dataset, check if the target feature is numerical and if it has only 2 unique values, reupload the same dataset, however, with the target type as bool. For categorical target features it gets a bit more tricky, since we would need to have a certain threshold that also should consider the number of instances.

amueller commented 4 years ago

@mfeurer had a testing repo that checked datasets for validity, I think. I'm not sure if this is still running. Having come form of check on the datasets would certainly be good.

amueller commented 4 years ago

@ArlindKadra did you deactivate the old version?

ArlindKadra commented 4 years ago

@ArlindKadra did you deactivate the old version?

@amueller I cannot deactivate it as I am not the owner of the dataset and I get this exception. openml.exceptions.OpenMLServerException: Dataset is not owned by you - None

However, you can.

ArlindKadra commented 4 years ago

@mfeurer had a testing repo that checked datasets for validity, I think. I'm not sure if this is still running. Having come form of check on the datasets would certainly be good.

@amueller Some time ago we worked on the same thing with @janvanrijn to validate datasets and to flag problematic ones on a daily/weekly basis. Not sure if it is running, but if it is we could modify it with one more check for this datasets.

mfeurer commented 4 years ago

Hey, the scripts @amueller is referring to is https://github.com/openml/openml-serverdata-quality-bot

No, I'm not running them any more right now.