harrelfe / Hmisc

Harrell Miscellaneous
Other
205 stars 81 forks source link

wtd.quantile misses the min #90

Open NicolasWoloszko opened 6 years ago

NicolasWoloszko commented 6 years ago

Using wtd.quantile on a weighted column (N=90000), we get this incoherence :

`

min(imp$DI2000eq) [1] -3993960

centile_cut=wtd.quantile(imp$DI2000eq, weights = imp$HW0010, probs = seq(0, 1, 0.01), normwt=F) centile_cut 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% -3.494166e+06 6.534453e+01 1.420000e+03 2.400000e+03 3.095238e+03 3.593949e+03 4.000000e+03 4.420000e+03 4.800000e+03 5.169554e+03 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 5.531415e+03 5.861055e+03 6.194661e+03 6.533333e+03 6.833333e+03 7.157366e+03 7.465169e+03 7.800000e+03 8.092000e+03 8.400000e+03 20% 21% 22% 23% 24% 25% 26% 27% 28% 29% 8.652000e+03 8.974359e+03 9.260000e+03 9.533202e+03 9.800000e+03 1.002000e+04 1.036558e+04 1.070524e+04 1.100000e+04 1.126897e+04 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 1.157366e+04 1.191000e+04 1.218000e+04 1.250000e+04 1.280000e+04 1.308200e+04 1.337662e+04 1.366667e+04 1.399100e+04 1.427554e+04 40% 41% 42% 43% 44% 45% 46% 47% 48% 49% 1.460000e+04 1.491000e+04 1.520000e+04 1.556667e+04 1.590094e+04 1.622066e+04 1.657819e+04 1.693794e+04 1.727152e+04 1.759000e+04 50% 51% 52% 53% 54% 55% 56% 57% 58% 59% 1.798000e+04 1.830000e+04 1.866667e+04 1.901000e+04 1.944444e+04 1.984000e+04 2.017000e+04 2.060800e+04 2.100000e+04 2.139018e+04 60% 61% 62% 63% 64% 65% 66% 67% 68% 69% 2.177872e+04 2.216673e+04 2.260000e+04 2.303905e+04 2.346100e+04 2.392479e+04 2.434667e+04 2.488650e+04 2.539000e+04 2.588889e+04 70% 71% 72% 73% 74% 75% 76% 77% 78% 79% 2.641667e+04 2.686667e+04 2.746667e+04 2.800000e+04 2.866762e+04 2.937000e+04 3.000000e+04 3.077163e+04 3.148000e+04 3.228000e+04 80% 81% 82% 83% 84% 85% 86% 87% 88% 89% 3.300000e+04 3.372222e+04 3.470000e+04 3.572800e+04 3.670000e+04 3.774667e+04 3.891133e+04 4.021566e+04 4.173913e+04 4.363000e+04 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 4.523000e+04 4.728000e+04 4.925000e+04 5.190000e+04 5.500000e+04 5.869498e+04 6.343019e+04 7.052509e+04 8.160000e+04 1.051400e+05 100% 5.088842e+06 `

As you see the actual min is -3993960 whereas the first percentile is -3.494166e+06. This creates problems for instance when used with cut().

harrelfe commented 6 years ago

I will need help with a code fix for this. It's best to do a Github pull and then have the system send me a merge request.

joblolabinette commented 5 years ago

I haven't have to chance to look at the code, but the 0% quantile seem to be defined as the value when the cumulative weight equals one. For the 100% percentile it's not clear but it's not the max either.

Example code: test_wq <- as.data.frame(list(values = c(1,2,2,2,3,3,3,3,3,3,3,4,4,4,5,5,5,5,6,6,8,8,8,8,8,9), wt = c(0.1,.2,.2,.2,.3,.3,.3,.3,.3,.3,.3,.4,.4,.4,.1,.1,.1,.1,.6,.6,.1,.1,.1,.1,.1,.1))) wtd.quantile(test_wq$values, weights = test_wq$wt)

0% 25% 50% 75% 100%
3.0 3.3 4.0 5.8 8.2

While without weights: wtd.quantile(test_wq$values)

0% 25% 50% 75% 100%
1 3 4 6 9