openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
669 stars 91 forks source link

Dataset issues on the beta server #55

Closed mfeurer closed 10 years ago

mfeurer commented 10 years ago

Hi,

going through the datasets on the beta server, I found some issues with them:

berndbischl commented 10 years ago

Before we produce ANY more runs, all of this should be addressed very soon!

joaquinvanschoren commented 10 years ago

1) Wrong class values need to be fixed, and all current tasks and runs removed. molecular-biology_promoters also has wrong target 2-3) What's the problem with unique and quasi-unique identifiers? They are just irrelevant attributes? 4) Either document that we are picking a single class for classification, or remove the dataset?

mfeurer commented 10 years ago

2+3) A classifier like a Decision Tree can for example consider the ID as the most informative attribute. Of course, this will result in a bad generalization and the user can find this out by looking at the model. I do not know how to properly deal with this, maybe either ignore and leave it to the people who use the dataset or somehow mask the attribute. 4) I rather thought of making three tasks for this dataset.

joaquinvanschoren commented 10 years ago

2+3) Indeed. Datasets all have varying degrees of 'cleaning'. This would be a case where a feature selection step before modelling would help. I don't think we should try to 'clean' all datasets. You could upload a derived dataset with fewer features if you like? 4) Yes, you are absolutely right :)

berndbischl commented 10 years ago

1) Of course agree with Joaquin, data (and all derived stuff) is then clearly wrong.

2-3) I disagree with Joaquin. I haven't checked the data sets, but from your posts I guess we have these 2 (very common) cases.

a) Some "row-id" or other identifier is in the data. Which identifies the "observation" in that row. Maybe the picture or sound bite we derived the features from. Or the person we made the measurements on. This should stay in the data, because it can clearly be useful and we should not suppress or delete it. But for modelling it MUST be marked so it can be automatically removed.

b) In some cases we have indices whether the data is in "train" or "test". http://openml.org/d/58 "Train_or_Test" This must be removed / converted to "Original Train / Test split" in the task.

Of course I cannot force / decide this, and I know that a) and b) are annoying and mean work, that nobody likes to do. But from a users perspective I would never use or like OpenML if this is not cleaned up / available in meta data.

That's why I bumped this (sorry for not responding earlier). I really do see no use in running experiments if we dont have "reliable" data to run models on....

joaquinvanschoren commented 10 years ago

1) Had a chat with Jan. It will be easier to change the default target feature, and then build a new task for that dataset. We'll add an api call with which you can ask for all the 'default' tasks, i.e. all the tasks that use the default target feature. We need to do this anyway because users can already create new tasks (with non-default target features) through the website, and you probably don't want all of them. 2+3) Had a chat with Bernd. The idea is to flag these datasets, and all existing tasks and runs on it, as deprecated/suspicious, and we will upload new version of these datasets without these rogue features. We may also need a (manual/automatic) check on newly uploaded datasets and mark trusted datasets as 'verified'. 4) I think we already agree on this one.

berndbischl commented 10 years ago

vowel http://openml.org/d/58: Like I said: This has "Train_or_Test". This is like an "index" for splits. Anyway, you understand the problem.

berndbischl commented 10 years ago

vowel again: I looked into one of the referenced papers. (Maybe stuff that comment now here should really go into a Wiki-like comment page for the data set). I was unsure whether you are allowed to use the Speaker_Number (identity of the speaker) as a feature. In the paper I looked at: yes. They treated this as a "contextual feature" - while the ones derived from sound are the "primary" features - and tried to exploit this extra info.

joaquinvanschoren commented 10 years ago
joaquinvanschoren commented 10 years ago

For the pseudo-identifiers: we have a few options:

Preferences?

mfeurer commented 10 years ago

I'm in the favor of option 2. With option 1 I think it would be very confusing if there are multiple datasets with the same name, at least I would need some time to figure out which of these to use.

joaquinvanschoren commented 10 years ago

OK, I've also been gravitating towards this solution. I'll do this as soon as possible.

Maybe it is best to add a button to the website to 'flag' an untrustworthy feature.

mfeurer commented 10 years ago

The 'flag' sounds good. I assume that openml.data.features will then return this?

joaquinvanschoren commented 10 years ago

Yes, that seems the best place for it. It could also be part of the dataset description, but I'm in favor of openml.data.features.

berndbischl commented 10 years ago

Please just "flag" the features, do not upload multiple versions of the data sets. Although I dislike the name "untrustworthy". Maybe features should simply have a "type"?

Eg.: input, output, index?

berndbischl commented 10 years ago

I also do not think that we currently need buttons to flag stuff like this. If we have

that seems enough for now? One can do so much already with that.

We have many discussions already on GH that could be in these "boxes". (I am not saying they should be there, we developers can discuss here just fine, I just mean it is obviously needed as a feature, especially for "normal" users)

joaquinvanschoren commented 10 years ago

OK, sounds good. I was going to label them 'ignore'.

'index' sounds logical, but are they really indexes?

berndbischl commented 10 years ago

Labeling them as "ignore" is totally OK for me. It would be perfect if there was a short note in the description why they were labeled like this....

berndbischl commented 10 years ago

And users need to visually see this state in the feature overview, so they are reminded of their burden to filter these guys out before modelling.

mfeurer commented 10 years ago

@joaquinvanschoren thanks for all the work on checking the datasets. I double-checked at least the datasets which I marked as being previously wrong and which are marked as safe now and found that for 46 and 185 the row_id_attribute is missing. For dataset 164 its there.

UPDATE: I checked some of the tasks, and for task 2103 the target attribute seems to be wrong. It should be class instead of attribute57

joaquinvanschoren commented 10 years ago

It is now possible to mark features that should be ignored as part of the new dataset edit feature. More on this later.

To close this issue, here is what I have done: