Open ArlindKadra opened 4 years ago
Were you trying to create the task through the web interface or a programmatic interface (and if so, which)?
It was through the web interface. Is it possible to change a feature type or does the dataset have to be reuploaded again ?
I'm not sure. As we're moving towards a new web interface soon, we won't change the old task form now. However we should update this dataset.
There was a pretty simple query that showed that at least dozens if not hundreds of datasets had the wrong target type. There might be an issue for it in this repo but I can also try to figure out the query if you want to fix them.
This is related: #20 #18
Yes, that is what I was actually asking, if this fix could be done on the backend or if the datasets would have to be reuploaded, at least for the simple cases, that have only a target binary feature given as numerical.
Also related to #30
I think to change the target type you need a new version, right? That is sort-of the purpose of versions. Though actually I don't think any clear meaning of versions is defined since you create a new version just by reusing an existing name, so they could also be completely unrelated (and sometimes are).
well, yes, although in my opinion versions translate to a dataset that is possibly modified/cleaned where instances/features can be removed. In this case, the previous versions where the target type is numerical and not nominal, are not correct.
Also, not sure if I remember correctly, but I think the drop-down button on the right at the dataset page used to show the different versions of that dataset. If so, it seems to be disabled now or maybe not working anymore.
@ArlindKadra that has been broken for a while, I think @joaquinvanschoren said it's not worth fixing for the old website? And we can disable the old broken datasets if we add new, correct versions.
Ah, I see, thanks for the info @amueller . I reuploaded the dataset with the correct target feature: https://www.openml.org/d/42397 I will do this for a few more datasets. I think this "problem" will keep on happening and most users will not notice if they are using pandas to read csv files, since bool variables or categorical ones that are not strings will have a numerical type.
There might be an issue for it in this repo but I can also try to figure out the query if you want to fix them.
I guess, we could go through each dataset, check if the target feature is numerical and if it has only 2 unique values, reupload the same dataset, however, with the target type as bool. For categorical target features it gets a bit more tricky, since we would need to have a certain threshold that also should consider the number of instances.
@mfeurer had a testing repo that checked datasets for validity, I think. I'm not sure if this is still running. Having come form of check on the datasets would certainly be good.
@ArlindKadra did you deactivate the old version?
@ArlindKadra did you deactivate the old version?
@amueller I cannot deactivate it as I am not the owner of the dataset and I get this exception.
openml.exceptions.OpenMLServerException: Dataset is not owned by you - None
However, you can.
@mfeurer had a testing repo that checked datasets for validity, I think. I'm not sure if this is still running. Having come form of check on the datasets would certainly be good.
@amueller Some time ago we worked on the same thing with @janvanrijn to validate datasets and to flag problematic ones on a daily/weekly basis. Not sure if it is running, but if it is we could modify it with one more check for this datasets.
Hey, the scripts @amueller is referring to is https://github.com/openml/openml-serverdata-quality-bot
No, I'm not running them any more right now.
Hey guys,
I was looking at the following dataset: https://www.openml.org/d/42175
It seems that the class feature is not given correctly as it should be nominal and not numerical. I was trying to create a classification task based on that and it was failing. I am guessing it is because it is not a nominal feature, however, no warning or error is given. It just fails.