Dataset issues on the beta server

mfeurer commented 10 years ago

Hi,

going through the datasets on the beta server, I found some issues with them:

Some of them have a wrong class attribute, these are:
- Spec_Train
- Spec_Test
- shuttle_landing_control
- lung_cancer
- All instances of the monks problem
Some of them have a unique identifier
- baseball
- molecular-biology_promoters
- Both versions of the bridges dataset
- splice
Some datasets have some index-like things. I am not sure about their influence on the classification task, but the end user should know this.
- colic
- colic.ORIG
- cylinder-bands
- vowel
- zoo
For the solar-flare datasets, the target class is set to one of three original target classes. This leads to two problems:
- The target class is 0 in nearly all cases. The other two possible values don't occur often enough to do proper crossvalidation.
- There are three target classes. For a task, the attributes of the other two targets should somehow be flagged to indicate that they should not be used.

berndbischl commented 10 years ago

Before we produce ANY more runs, all of this should be addressed very soon!

joaquinvanschoren commented 10 years ago

1) Wrong class values need to be fixed, and all current tasks and runs removed. molecular-biology_promoters also has wrong target 2-3) What's the problem with unique and quasi-unique identifiers? They are just irrelevant attributes? 4) Either document that we are picking a single class for classification, or remove the dataset?

mfeurer commented 10 years ago

2+3) A classifier like a Decision Tree can for example consider the ID as the most informative attribute. Of course, this will result in a bad generalization and the user can find this out by looking at the model. I do not know how to properly deal with this, maybe either ignore and leave it to the people who use the dataset or somehow mask the attribute. 4) I rather thought of making three tasks for this dataset.

joaquinvanschoren commented 10 years ago

2+3) Indeed. Datasets all have varying degrees of 'cleaning'. This would be a case where a feature selection step before modelling would help. I don't think we should try to 'clean' all datasets. You could upload a derived dataset with fewer features if you like? 4) Yes, you are absolutely right :)

berndbischl commented 10 years ago

1) Of course agree with Joaquin, data (and all derived stuff) is then clearly wrong.

2-3) I disagree with Joaquin. I haven't checked the data sets, but from your posts I guess we have these 2 (very common) cases.

a) Some "row-id" or other identifier is in the data. Which identifies the "observation" in that row. Maybe the picture or sound bite we derived the features from. Or the person we made the measurements on. This should stay in the data, because it can clearly be useful and we should not suppress or delete it. But for modelling it MUST be marked so it can be automatically removed.

b) In some cases we have indices whether the data is in "train" or "test". http://openml.org/d/58 "Train_or_Test" This must be removed / converted to "Original Train / Test split" in the task.

Of course I cannot force / decide this, and I know that a) and b) are annoying and mean work, that nobody likes to do. But from a users perspective I would never use or like OpenML if this is not cleaned up / available in meta data.

That's why I bumped this (sorry for not responding earlier). I really do see no use in running experiments if we dont have "reliable" data to run models on....

joaquinvanschoren commented 10 years ago

1) Had a chat with Jan. It will be easier to change the default target feature, and then build a new task for that dataset. We'll add an api call with which you can ask for all the 'default' tasks, i.e. all the tasks that use the default target feature. We need to do this anyway because users can already create new tasks (with non-default target features) through the website, and you probably don't want all of them. 2+3) Had a chat with Bernd. The idea is to flag these datasets, and all existing tasks and runs on it, as deprecated/suspicious, and we will upload new version of these datasets without these rogue features. We may also need a (manual/automatic) check on newly uploaded datasets and mark trusted datasets as 'verified'. 4) I think we already agree on this one.

berndbischl commented 10 years ago

vowel http://openml.org/d/58: Like I said: This has "Train_or_Test". This is like an "index" for splits. Anyway, you understand the problem.

berndbischl commented 10 years ago

vowel again: I looked into one of the referenced papers. (Maybe stuff that comment now here should really go into a Wiki-like comment page for the data set). I was unsure whether you are allowed to use the Speaker_Number (identity of the speaker) as a feature. In the paper I looked at: yes. They treated this as a "contextual feature" - while the ones derived from sound are the "primary" features - and tried to exploit this extra info.

joaquinvanschoren commented 10 years ago

Wrong target features: I went through all datasets to check and fix the target features.
Unique identifiers: Then I looked for all features that were unique identifiers and set them as the row_id_attribute, so you can ignore them in your experiments.
Train/test features: I found 3 cases: vowel, and two BNG-generated variations on vowel. These have been marked as 'unsafe' and I added a new 'safe' version of vowel: http://www.openml.org/d/307
Pseudo-identifiers: cylinder-bands has 'cylinder-number', colic has 'Hospital_Number', zoo has 'animal'. These are now marked as unsafe. Not sure what to do here. Do you also want me to remove these features? Or mark them differently?
Multiple class features (solar flare): Marked as unsafe for now.

joaquinvanschoren commented 10 years ago

For the pseudo-identifiers: we have a few options:

Upload a new version of the dataset without that feature. This new version will then be marked 'safe' for automated experimentation.
Add a new type of label to these features, e.g. 'ignore' or 'should not be modelled'? This is more work but may be more useful in the future?

Preferences?

mfeurer commented 10 years ago

I'm in the favor of option 2. With option 1 I think it would be very confusing if there are multiple datasets with the same name, at least I would need some time to figure out which of these to use.

joaquinvanschoren commented 10 years ago

OK, I've also been gravitating towards this solution. I'll do this as soon as possible.

Maybe it is best to add a button to the website to 'flag' an untrustworthy feature.

mfeurer commented 10 years ago

The 'flag' sounds good. I assume that openml.data.features will then return this?

joaquinvanschoren commented 10 years ago

Yes, that seems the best place for it. It could also be part of the dataset description, but I'm in favor of openml.data.features.

berndbischl commented 10 years ago

Please just "flag" the features, do not upload multiple versions of the data sets. Although I dislike the name "untrustworthy". Maybe features should simply have a "type"?

Eg.: input, output, index?

berndbischl commented 10 years ago

I also do not think that we currently need buttons to flag stuff like this. If we have

free text comment boxes for users
a general "data set problem" button / flag,

that seems enough for now? One can do so much already with that.

We have many discussions already on GH that could be in these "boxes". (I am not saying they should be there, we developers can discuss here just fine, I just mean it is obviously needed as a feature, especially for "normal" users)

joaquinvanschoren commented 10 years ago

OK, sounds good. I was going to label them 'ignore'.

'index' sounds logical, but are they really indexes?

berndbischl commented 10 years ago

Labeling them as "ignore" is totally OK for me. It would be perfect if there was a short note in the description why they were labeled like this....

berndbischl commented 10 years ago

And users need to visually see this state in the feature overview, so they are reminded of their burden to filter these guys out before modelling.

mfeurer commented 10 years ago

@joaquinvanschoren thanks for all the work on checking the datasets. I double-checked at least the datasets which I marked as being previously wrong and which are marked as safe now and found that for 46 and 185 the row_id_attribute is missing. For dataset 164 its there.

UPDATE: I checked some of the tasks, and for task 2103 the target attribute seems to be wrong. It should be class instead of attribute57

joaquinvanschoren commented 10 years ago

It is now possible to mark features that should be ignored as part of the new dataset edit feature. More on this later.

To close this issue, here is what I have done:

Marked the problematic features in datasets 164, 185, 46 as well as all previously mentioned ones: cylinder-bands, colic and zoo
I also cleaned up their descriptions (wiki)
I added a note in the wiki explaining why these features are ignored, as well as in the dataset update comments. You can view those here: http://openml.org/d?sort=last_update&order=desc
In the feature overview, I added visual labels for the target(s), row_identifier and ignored feature(s)
Also, the value distribution (and if appropriate, the class distribution) is now visualized for all nominal features. This should make it easy to spot problematic features.
I removed task 2103. It used the 'wrong' target feature

openml / OpenML

Dataset issues on the beta server #55