openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
661 stars 91 forks source link

Mistake in kurtossis (and maybe skewness) metafeatures? #297

Open PGijsbers opened 8 years ago

PGijsbers commented 8 years ago

Hi,

I have been working on computing my own meta-features, and when I tried to verify my Kurtosis and Skewness meta-features with those on the website (in particular for the iris dataset: openml.org/d/61) I came up with different results.

For the iris dataset, the mean of kurtosises (MeanKurtosisOfNumeric) has a value of 32.79 which seems extremely high to me. Through using scipy.stats.kurtosis, I get the following kurtosis values: -0.57, 0.24, -1.40, -1.33. The mean of these values is of course -0.77, nowhere close to 32.79. I looked at the code that calculates the kurtosis, but I couldn't find any mistakes from glancing over quickly. I checked with Excel, and while they use a slightly different version of kurtosis, their values were also very similar to mine.

My skewnesses results also differ, again I checked with excel and scipy, but the skewness on OpenML might not be entirely correct either. The values differ less though (my/excel skewness mean is roughly 0.06, while OpenML says 1.6), but it still seems like a big difference to me.

Any input is appreciated.

berndbischl commented 8 years ago

the kurtosis can never be < 0?

see here https://en.wikipedia.org/wiki/Kurtosis

this is what R computes

library(moments)
d = iris[, -5]
k = kurtosis(d)
print(k)
print(mean(k))
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       2.426        3.181        1.604        1.664 

[1] 2.219

which of course also completely different to the OML numbers you quoted ;-)

berndbischl commented 8 years ago

ok, scipy computes the excess kurtosis by fisher, so we have to subtract 3....

> k - 3
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
     -0.5736       0.1810      -1.3955      -1.3361 
berndbischl commented 8 years ago

now we (me and @PG-TUe ) still have problem with the 2nd value? why is that different?

berndbischl commented 8 years ago

cool, apparently the iris data set from R (which i used) and the one on OML are not the same...


> j = c(35, 38)

# from R
> iris[j, ]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
35          4.9         3.1          1.5         0.2  setosa
38          4.9         3.6          1.4         0.1  setosa

# from OML / did = 61
> od$data[j, ]
   sepallength sepalwidth petallength petalwidth       class
34         4.9        3.1         1.5        0.1 Iris-setosa
37         4.9        3.1         1.5        0.1 Iris-setosa

paging @joaquinvanschoren to this horrible thread

PGijsbers commented 8 years ago

Yes indeed, scipy.stats uses Fisher by default, sorry for that omission (link to scipy docs).

Different datasets makes sense! At least for our difference, not the one on the OpenML calculated feature. It wouldn't be anything in the computation if the other three come out right (generally speaking, of course).

This still leaves a problem for the actual OpenML calculated value though, with a mean kurtossis of 32.79.

berndbischl commented 8 years ago

This still leaves a problem for the actual OpenML calculated value though, with a mean kurtossis of 32.79.

yes and no, it also leaves the 2nd problem why the 2 data sets are not the same! i dont see any reason at all how they could or should be different. i mean it's "iris"....

berndbischl commented 8 years ago

ah cool

from wikipedia

https://en.wikipedia.org/wiki/Iris_flower_data_set

(scroll down)

"Fisher's Iris Data". (Contains two errors which are documented). UCI Machine Learning Repository: Iris Data Set.

(click link)

This data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick '@' espeedaz.net ). The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features.

berndbischl commented 8 years ago

OML imported the (erroneous) version from UCI....

PGijsbers commented 8 years ago

You beat me to it! I guess it might be good to maybe upload a different version then, as it is so often used for examples.

berndbischl commented 8 years ago

@joaquinvanschoren

great chance to now try out whether its possible to use the "problem here, flag a data set as faulty" mechanism....

i would have just never guessed that we need it already for iris .... :(

PS:

things like these are the reason why i requested this so often

PGijsbers commented 8 years ago

I suppose the UCI link was in the description of the dataset.. but yeah, I would not have expected that. Also, I did submit an issue on the site, then removed it (thought I did something wrong), and now I can't submit an issue anymore (well, technically it looks like it lets me submit an issue, but the site doesn't respond).

joaquinvanschoren commented 8 years ago

Interesting! Will upload the new version. From which package did you get the 'correct' iris?

I also noticed a bug in the issue reporting system, will look at this asap (although I'm on holiday :)).

Cheers, Joaquin

On Thu, 14 Jul 2016 at 23:44, Bernd Bischl notifications@github.com wrote:

@joaquinvanschoren https://github.com/joaquinvanschoren

great chance to now try out whether its possible to use the "problem here, flag a data set as faulty" mechanism....

i would have just never guessed that we need it already for iris .... :(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/297#issuecomment-232801788, or mute the thread https://github.com/notifications/unsubscribe/ABpQV_Y7DwMCWJenN8QT244_NeuLcJKUks5qVq3MgaJpZM4JMyqC .

PGijsbers commented 8 years ago

Good to hear, but don't overwork yourself! Anyhow, this still leaves open the issue with calculated kurtosis.

PGijsbers commented 8 years ago

Actually, it looks like some other meta-features are not up to snuff either. When we look at dataset 62, the zoo dataset (link), we see it only has one numerical feature, the number of legs (probably should have been nominal, but nevermind that fact). We see that this feature only has values 0, 2, 4, 5, 6 and 8. Yet the MeanMeansOfNumericAtts is 100.14.

I also started working on information-theoretic meta-features, and while the ClassEntropy seems correct, my MeanAttributeEntropy feature also differs, though I haven't dived in to check whether or not that is on my end. All of the features that go wrong so far are aggregate statistics, I am not sure if that helps any.

berndbischl commented 8 years ago

Interesting! Will upload the new version. From which package did you get the 'correct' iris?

http://www.rdocumentation.org/packages/datasets/versions/3.3.1/topics/iris

Will upload the new version.

i would be interested in the process of "flagging" the other data set now, what that implies and happens. can you please outline this here? this seems a good usecase now.