Open mcorrell opened 6 years ago
I'd say mode skewness should compute the correct thing so 2. But we could keep the existing function around but with a different name, which is 1. And while you're at it, why not add Pearson's median skewness (3.)?
mode
is a little tricky, and I think the current datalib map
function might be more useful in 99% of cases. For instance, mode([1,2])
should, in principle, return the unordered set {1,2}
, since the distribution has two modes. But it seems silly for mode([1,1])
to return {1}
instead of 1
. And then there's the question of null
s and undefined
s and so on. I think it's useful for somebody to know that the mode of [1,null,null]
is null,
but that's a judgment call. In either event, we'd have a lot of cases where we would not have a single numerical mode, and so mode skewness would be undefined.
I'm now leaning towards an option not on my list, computing skewness using the third standardized moment, rather than having to mess with a mode
function. But I can submit a PR with a mode
function (that would just call map
and skim off the elements with the highest count) and see what people think.
I would recommend renaming the function to something more appropriate (npskew
?), but leave a deprecated reference to the existing function name modeskew
to support backwards compatibility. I can add appropriate warnings / caveats to the documentation. (FWIW, this method was removed from Vega in version 3 onward, though obviously persists here in datalib.)
I don't think inclusion of mode
is needed, given existing functionality. A new skew
method for the third standardized moment seems reasonable to add.
Okay, I made a PR #91 that adds sample skew. I wasn't sure how to sunset modeskew
exactly so I just punted on that.
datalib currently calculates what is supposed to be Pearson's mode skewness, but does so as:
(avg - med) / std
This is non-parametric skewness, rather than:
Pearson's mode skewness which is
(avg - mode) / std
Or Pearson's median skewness, which is
3 * (avg - med) / std
So this is only giving correct values when
med == mode.
Should I: 1) Rename the function to reflect that it's nonparametric skew. 2) Add a
mode
function, and then fix themodeskew
function by callingmode.
3) Add a factor of3
and rename the function to reflect that it's median skew. 4) Some combination of those things. 5) None of those things.