Open PGijsbers opened 8 years ago
the kurtosis can never be < 0?
see here https://en.wikipedia.org/wiki/Kurtosis
this is what R computes
library(moments)
d = iris[, -5]
k = kurtosis(d)
print(k)
print(mean(k))
Sepal.Length Sepal.Width Petal.Length Petal.Width
2.426 3.181 1.604 1.664
[1] 2.219
which of course also completely different to the OML numbers you quoted ;-)
ok, scipy computes the excess kurtosis by fisher, so we have to subtract 3....
> k - 3
Sepal.Length Sepal.Width Petal.Length Petal.Width
-0.5736 0.1810 -1.3955 -1.3361
now we (me and @PG-TUe ) still have problem with the 2nd value? why is that different?
cool, apparently the iris data set from R (which i used) and the one on OML are not the same...
> j = c(35, 38)
# from R
> iris[j, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
35 4.9 3.1 1.5 0.2 setosa
38 4.9 3.6 1.4 0.1 setosa
# from OML / did = 61
> od$data[j, ]
sepallength sepalwidth petallength petalwidth class
34 4.9 3.1 1.5 0.1 Iris-setosa
37 4.9 3.1 1.5 0.1 Iris-setosa
paging @joaquinvanschoren to this horrible thread
Yes indeed, scipy.stats uses Fisher by default, sorry for that omission (link to scipy docs).
Different datasets makes sense! At least for our difference, not the one on the OpenML calculated feature. It wouldn't be anything in the computation if the other three come out right (generally speaking, of course).
This still leaves a problem for the actual OpenML calculated value though, with a mean kurtossis of 32.79.
This still leaves a problem for the actual OpenML calculated value though, with a mean kurtossis of 32.79.
yes and no, it also leaves the 2nd problem why the 2 data sets are not the same! i dont see any reason at all how they could or should be different. i mean it's "iris"....
ah cool
from wikipedia
https://en.wikipedia.org/wiki/Iris_flower_data_set
(scroll down)
"Fisher's Iris Data". (Contains two errors which are documented). UCI Machine Learning Repository: Iris Data Set.
(click link)
This data differs from the data presented in Fishers article (identified by Steve Chadwick, spchadwick '@' espeedaz.net ). The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features.
OML imported the (erroneous) version from UCI....
You beat me to it! I guess it might be good to maybe upload a different version then, as it is so often used for examples.
@joaquinvanschoren
great chance to now try out whether its possible to use the "problem here, flag a data set as faulty" mechanism....
i would have just never guessed that we need it already for iris .... :(
PS:
things like these are the reason why i requested this so often
I suppose the UCI link was in the description of the dataset.. but yeah, I would not have expected that. Also, I did submit an issue on the site, then removed it (thought I did something wrong), and now I can't submit an issue anymore (well, technically it looks like it lets me submit an issue, but the site doesn't respond).
Interesting! Will upload the new version. From which package did you get the 'correct' iris?
I also noticed a bug in the issue reporting system, will look at this asap (although I'm on holiday :)).
Cheers, Joaquin
On Thu, 14 Jul 2016 at 23:44, Bernd Bischl notifications@github.com wrote:
@joaquinvanschoren https://github.com/joaquinvanschoren
great chance to now try out whether its possible to use the "problem here, flag a data set as faulty" mechanism....
i would have just never guessed that we need it already for iris .... :(
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/297#issuecomment-232801788, or mute the thread https://github.com/notifications/unsubscribe/ABpQV_Y7DwMCWJenN8QT244_NeuLcJKUks5qVq3MgaJpZM4JMyqC .
Good to hear, but don't overwork yourself! Anyhow, this still leaves open the issue with calculated kurtosis.
Actually, it looks like some other meta-features are not up to snuff either. When we look at dataset 62, the zoo dataset (link), we see it only has one numerical feature, the number of legs (probably should have been nominal, but nevermind that fact). We see that this feature only has values 0, 2, 4, 5, 6 and 8. Yet the MeanMeansOfNumericAtts is 100.14.
I also started working on information-theoretic meta-features, and while the ClassEntropy seems correct, my MeanAttributeEntropy feature also differs, though I haven't dived in to check whether or not that is on my end. All of the features that go wrong so far are aggregate statistics, I am not sure if that helps any.
Interesting! Will upload the new version. From which package did you get the 'correct' iris?
http://www.rdocumentation.org/packages/datasets/versions/3.3.1/topics/iris
Will upload the new version.
i would be interested in the process of "flagging" the other data set now, what that implies and happens. can you please outline this here? this seems a good usecase now.
Hi,
I have been working on computing my own meta-features, and when I tried to verify my Kurtosis and Skewness meta-features with those on the website (in particular for the iris dataset: openml.org/d/61) I came up with different results.
For the iris dataset, the mean of kurtosises (MeanKurtosisOfNumeric) has a value of 32.79 which seems extremely high to me. Through using scipy.stats.kurtosis, I get the following kurtosis values: -0.57, 0.24, -1.40, -1.33. The mean of these values is of course -0.77, nowhere close to 32.79. I looked at the code that calculates the kurtosis, but I couldn't find any mistakes from glancing over quickly. I checked with Excel, and while they use a slightly different version of kurtosis, their values were also very similar to mine.
My skewnesses results also differ, again I checked with excel and scipy, but the skewness on OpenML might not be entirely correct either. The values differ less though (my/excel skewness mean is roughly 0.06, while OpenML says 1.6), but it still seems like a big difference to me.
Any input is appreciated.