Closed mfeurer closed 10 years ago
Before we produce ANY more runs, all of this should be addressed very soon!
1) Wrong class values need to be fixed, and all current tasks and runs removed. molecular-biology_promoters also has wrong target 2-3) What's the problem with unique and quasi-unique identifiers? They are just irrelevant attributes? 4) Either document that we are picking a single class for classification, or remove the dataset?
2+3) A classifier like a Decision Tree can for example consider the ID as the most informative attribute. Of course, this will result in a bad generalization and the user can find this out by looking at the model. I do not know how to properly deal with this, maybe either ignore and leave it to the people who use the dataset or somehow mask the attribute. 4) I rather thought of making three tasks for this dataset.
2+3) Indeed. Datasets all have varying degrees of 'cleaning'. This would be a case where a feature selection step before modelling would help. I don't think we should try to 'clean' all datasets. You could upload a derived dataset with fewer features if you like? 4) Yes, you are absolutely right :)
1) Of course agree with Joaquin, data (and all derived stuff) is then clearly wrong.
2-3) I disagree with Joaquin. I haven't checked the data sets, but from your posts I guess we have these 2 (very common) cases.
a) Some "row-id" or other identifier is in the data. Which identifies the "observation" in that row. Maybe the picture or sound bite we derived the features from. Or the person we made the measurements on. This should stay in the data, because it can clearly be useful and we should not suppress or delete it. But for modelling it MUST be marked so it can be automatically removed.
b) In some cases we have indices whether the data is in "train" or "test". http://openml.org/d/58 "Train_or_Test" This must be removed / converted to "Original Train / Test split" in the task.
Of course I cannot force / decide this, and I know that a) and b) are annoying and mean work, that nobody likes to do. But from a users perspective I would never use or like OpenML if this is not cleaned up / available in meta data.
That's why I bumped this (sorry for not responding earlier). I really do see no use in running experiments if we dont have "reliable" data to run models on....
1) Had a chat with Jan. It will be easier to change the default target feature, and then build a new task for that dataset. We'll add an api call with which you can ask for all the 'default' tasks, i.e. all the tasks that use the default target feature. We need to do this anyway because users can already create new tasks (with non-default target features) through the website, and you probably don't want all of them. 2+3) Had a chat with Bernd. The idea is to flag these datasets, and all existing tasks and runs on it, as deprecated/suspicious, and we will upload new version of these datasets without these rogue features. We may also need a (manual/automatic) check on newly uploaded datasets and mark trusted datasets as 'verified'. 4) I think we already agree on this one.
vowel http://openml.org/d/58: Like I said: This has "Train_or_Test". This is like an "index" for splits. Anyway, you understand the problem.
vowel again: I looked into one of the referenced papers. (Maybe stuff that comment now here should really go into a Wiki-like comment page for the data set). I was unsure whether you are allowed to use the Speaker_Number (identity of the speaker) as a feature. In the paper I looked at: yes. They treated this as a "contextual feature" - while the ones derived from sound are the "primary" features - and tried to exploit this extra info.
For the pseudo-identifiers: we have a few options:
Preferences?
I'm in the favor of option 2. With option 1 I think it would be very confusing if there are multiple datasets with the same name, at least I would need some time to figure out which of these to use.
OK, I've also been gravitating towards this solution. I'll do this as soon as possible.
Maybe it is best to add a button to the website to 'flag' an untrustworthy feature.
The 'flag' sounds good. I assume that openml.data.features will then return this?
Yes, that seems the best place for it. It could also be part of the dataset description, but I'm in favor of openml.data.features.
Please just "flag" the features, do not upload multiple versions of the data sets. Although I dislike the name "untrustworthy". Maybe features should simply have a "type"?
Eg.: input, output, index?
I also do not think that we currently need buttons to flag stuff like this. If we have
that seems enough for now? One can do so much already with that.
We have many discussions already on GH that could be in these "boxes". (I am not saying they should be there, we developers can discuss here just fine, I just mean it is obviously needed as a feature, especially for "normal" users)
OK, sounds good. I was going to label them 'ignore'.
'index' sounds logical, but are they really indexes?
Labeling them as "ignore" is totally OK for me. It would be perfect if there was a short note in the description why they were labeled like this....
And users need to visually see this state in the feature overview, so they are reminded of their burden to filter these guys out before modelling.
@joaquinvanschoren thanks for all the work on checking the datasets. I double-checked at least the datasets which I marked as being previously wrong and which are marked as safe now and found that for 46 and 185 the row_id_attribute is missing. For dataset 164 its there.
UPDATE: I checked some of the tasks, and for task 2103 the target attribute seems to be wrong. It should be class instead of attribute57
It is now possible to mark features that should be ignored as part of the new dataset edit feature. More on this later.
To close this issue, here is what I have done:
Hi,
going through the datasets on the beta server, I found some issues with them:
0
in nearly all cases. The other two possible values don't occur often enough to do proper crossvalidation.