harrelfe / Hmisc

Harrell Miscellaneous
Other
208 stars 81 forks source link

Zero weights are not equivalent to omitting a value with wtd.quantile() #81

Closed nalimilan closed 6 years ago

nalimilan commented 6 years ago

I'm not sure this is actually a bug, but I would naively have expected that frequency weights passed to wtd.quantile were equivalent to repeating a value the corresponding number of times. As shown below, this assumption works for non-zero weights, but not for zero weights:

> quantile(c(2, 3, 3, 4))
  0%  25%  50%  75% 100% 
2.00 2.75 3.00 3.25 4.00 
> wtd.quantile(c(2,3,3,4), c(1,1,1,1), seq(0, 1, by=.25))
  0%  25%  50%  75% 100% 
2.00 2.75 3.00 3.25 4.00 
> wtd.quantile(c(2,3,4), c(1,2,1), seq(0, 1, by=.25))
  0%  25%  50%  75% 100% 
2.00 2.75 3.00 3.25 4.00 
# Gives a different result
> wtd.quantile(c(1,2,3,4,5), c(0,1,2,1,0), seq(0, 1, by=.25))
   0%   25%   50%   75%  100% 
2.000 2.750 3.000 3.375 4.500 

This also happens with other values of the type argument.

Is there a particular reason for this? Would you have a reference about the implemented algorithm?

harrelfe commented 6 years ago

It makes sense to me that those with zero or negative weight should be removed from the dataset before doing anything. Let me know if you see a problem with that strategy, otherwise I'll make the change for the next release.

nalimilan commented 6 years ago

I don't see any problem with removing them beforehand. But I'm surprised the algorithm isn't equivalent to removing them explicitly, since it's equivalent to repeating each value a number of times corresponding to its weight (for non-zero integer weights at least).

nalimilan commented 6 years ago

BTW, this sounds surprising to me:

> wtd.quantile(c(7, 1, 2, 4, 10, 15), c(1, 1/3, 1/3, 1/3, 1, 1))
   0%   25%   50%   75%  100% 
 4.00  6.25  8.50 11.25 15.00 
> wtd.quantile(c(7, 1, 2, 4, 10, 15), c(1, 1/3, 1/3, 1/3, 1, 1), normwt=T)
   0%   25%   50%   75%  100% 
 2.00  7.00  8.50 13.75 15.00 

Shouldn't the 0th quantile always be equal to the minimum?

harrelfe commented 6 years ago

Good question. Sorry I don't have time to diagnosis this. Hope you can.

nalimilan commented 6 years ago

FWIW we're working on an implementation for Julia at https://github.com/JuliaStats/StatsBase.jl/pull/316/, so you could take inspiration from there (though I'm still not sure what's the best approach but @matthieugomez knows better).

harrelfe commented 6 years ago

For the upcoming release on CRAN I am excluding zero weighted observations up front in wtd.quantile.

nalimilan commented 6 years ago

Thanks! I confirm that the examples above work fine now.