vega / datalib

JavaScript data utility library.
http://vega.github.io/datalib/
BSD 3-Clause "New" or "Revised" License
732 stars 133 forks source link

The wrong skew measure? #90

Open mcorrell opened 6 years ago

mcorrell commented 6 years ago

datalib currently calculates what is supposed to be Pearson's mode skewness, but does so as: (avg - med) / std

This is non-parametric skewness, rather than:

Pearson's mode skewness which is (avg - mode) / std

Or Pearson's median skewness, which is 3 * (avg - med) / std

So this is only giving correct values when med == mode.

Should I: 1) Rename the function to reflect that it's nonparametric skew. 2) Add a mode function, and then fix the modeskew function by calling mode. 3) Add a factor of 3 and rename the function to reflect that it's median skew. 4) Some combination of those things. 5) None of those things.

domoritz commented 6 years ago

I'd say mode skewness should compute the correct thing so 2. But we could keep the existing function around but with a different name, which is 1. And while you're at it, why not add Pearson's median skewness (3.)?

mcorrell commented 6 years ago

mode is a little tricky, and I think the current datalib map function might be more useful in 99% of cases. For instance, mode([1,2]) should, in principle, return the unordered set {1,2}, since the distribution has two modes. But it seems silly for mode([1,1]) to return {1} instead of 1. And then there's the question of nulls and undefineds and so on. I think it's useful for somebody to know that the mode of [1,null,null] is null, but that's a judgment call. In either event, we'd have a lot of cases where we would not have a single numerical mode, and so mode skewness would be undefined.

I'm now leaning towards an option not on my list, computing skewness using the third standardized moment, rather than having to mess with a mode function. But I can submit a PR with a mode function (that would just call map and skim off the elements with the highest count) and see what people think.

jheer commented 6 years ago

I would recommend renaming the function to something more appropriate (npskew?), but leave a deprecated reference to the existing function name modeskew to support backwards compatibility. I can add appropriate warnings / caveats to the documentation. (FWIW, this method was removed from Vega in version 3 onward, though obviously persists here in datalib.)

I don't think inclusion of mode is needed, given existing functionality. A new skew method for the third standardized moment seems reasonable to add.

mcorrell commented 6 years ago

Okay, I made a PR #91 that adds sample skew. I wasn't sure how to sunset modeskew exactly so I just punted on that.