spatstat / spatstat.geom

Sub-package of spatstat defining spatial data and spatial/geometrical operations
7 stars 4 forks source link

`weighted.median` not in line with unweighted median on repeated entries #19

Closed SlowMo24 closed 5 months ago

SlowMo24 commented 5 months ago

The documentation of the weighted.median states that the resulting value will be

...a value m such that the total weight of data less than or equal to m is equal to half the total weight.

The documentation though does not state a source for that statement. To my understanding the equal to is questionable, the median should 'strictly' halve the dataset. A different algorithm specifically states that the equals to (if applicable) is applicable in both direction (up and down): https://rdrr.io/cran/matrixStats/man/weightedMedian.html (although the wording also is not 100% clear).

I know that this question is similar to the ones asked here https://github.com/spatstat/spatstat/issues/128 and here https://github.com/spatstat/spatstat.geom/issues/5 but the given solution does not fully answer my question.

In the following example, I expect the median and the weighted median to be equal. Where am I having a misunderstanding?

library(spatstat)
#> Lade nötiges Paket: spatstat.data
#> Lade nötiges Paket: spatstat.geom
#> spatstat.geom 3.2-9
#> Lade nötiges Paket: spatstat.random
#> spatstat.random 3.2-3
#> Lade nötiges Paket: spatstat.explore
#> Lade nötiges Paket: nlme
#> spatstat.explore 3.2-7
#> Lade nötiges Paket: spatstat.model
#> Lade nötiges Paket: rpart
#> spatstat.model 3.2-11
#> Lade nötiges Paket: spatstat.linnet
#> spatstat.linnet 3.1-5
#> 
#> spatstat 3.0-8 
#> For an introduction to spatstat, type 'beginner'
weighted.median(c(1,2,3), c(1,1,1))
#> [1] 1.5
median(c(1,2,3))
#> [1] 2

weighted.median(c(1,2,3), c(2,1,1))
#> [1] 1
median(c(1,1,2,3))
#> [1] 1.5

weighted.median(c(1,2,3,4), c(1,1,1,1))
#> [1] 2
median(c(1,2,3,4))
#> [1] 2.5

Created on 2024-04-29 with reprex v2.1.0

rolfTurner commented 5 months ago

The documentation of the weighted.median states that the resulting value will be

...a value m such that the total weight of data less than or equal to m is equal to half the total weight.

The documentation though does not state a source for that statement.


This is the definition used by the software. It is a defiinition, not a theorem. No "source" is needed.

The documentation goes on to say: "If there is no such value, then ...."

First example: weighted.median(c(1,2,3), c(1,1,1)) (Total weight)/2 = 3/2 = 1.5 First weight = 1; i.e. total weight of data less than or equal to 1 is 1, which is not equal to (is less than) 1.5 Sum of first two weights = 2; i.e. total weight of data less than or equal to 2 is 2 , which is not equal to (is greater than) 1.5. There is no m such that the total weight of data less than or equal to m is equal to 1.5. The two surrounding values are 1 and 2. When type=2 (the default) the average of these, i.e. 1.5 is returned.

Second example: weighted.median(c(1,2,3), c(2,1,1)) (Total weight)/2 = 4/2 = 2 First weight = 2 , i.e. [total weight of data less than or equal to 1] is 2 = (total weight)/2. The m, such that [the total weight of data less than or equal to m is equal to (total weight)/2], is 1.

Third example: weighted.median(c(1,2,3,4), c(1,1,1,1)) (Total weight)/2 = 4/2 = 2 First weight = 1; sum of first two weights = 2 , i.e. [total weight of data less than or equal to 2] is 2 = (total weight)/2. The m, such that [the total weight of data less than or equal to m is equal to (total weight)/2], is 2.

Thus weighted.median is performing exactly as advertised.

See also fortunes::fortune(85).

baddstats commented 5 months ago

Short answer: when checking for consistency you need to consider the argument type.

Long answer:

The important thing to understand is that there are several possible definitions of the sample median and sample quantiles, depending on the desired properties. See the article R. Hyndman and Y. Fan (1996) Sample quantiles in statistical packages. The American Statistician 50 (4) 361-365. Many people seem unwilling to accept this fact and insist that there must be a unique definition, which is simply false.

In base R, the median function stats::median.default calculates the sample median of numerical values according to the classical definition (going back to the 16th century), namely, if the number n of data values is odd (n=2k+1) then the median is the _k_th smallest value, while if n is even (n=2k) then the median is the average of the _k_th and _k+1_st smallest values. It does not have an option to compute other versions of the median.

In base R, the quantile function stats::quantile.default is able to calculate any of the 9 definitions or types of sample quantile that were documented by Hyndman and Fan. In particular you can get 9 different versions of the sample median, by setting probs=0.5 and choosing different values of type. This agrees with median.default if you set type = 4.

For the weighted median and weighted quantiles, again there are several different definitions. In package spatstat.geom the functions weighted.median and weighted.quantile use the same designation of types of quantiles as in stats::quantile.default. Types 1, 2 and 4 are currently implemented. The default is type=2 for weighted.median and type=4 for weighted.quantile.

Thus, when checking consistency, you would need to set the argument type, in each of the functions listed above (except median.default which implicitly has type=4).

SlowMo24 commented 5 months ago

Thank you for the clarification @baddstats . I was aware of the different types but misinterpreted which type I wanted.

I was looking for the behaviour of type=2 in stats::quantile but that seems to be different from the one spatstat::weighted.quantile has implemented:

for (data in list(list(data = c(1,2,3), weights = c(1,1,1), resolved_data=c(1,2,3)),
               list(data = c(1,2,3), weights = c(2,1,1), resolved_data=c(1,1,2,3)),
               list(data = c(1,2,3,4), weights = c(1,1,1,1), resolved_data=c(1,2,3,4)))){
  for (type in c(1,2,4)){
    sp <- spatstat.geom::weighted.quantile(data$data, data$weights, probs=c(0.5),type=type)
    ba <- stats::quantile(data$resolved_data, probs=c(0.5), type=type)
    ma <- matrixStats::weightedMedian(data$data, data$weights, interpolate = FALSE, ties="mean")
    cat("For type ", type, " and data ",data$data, " with weights ",data$weights, " spatstat gives ", sp, " matrixStats gives ", ma, " and base R gives ", ba, " as median\n")
  }
}
#> For type  1  and data  1 2 3  with weights  1 1 1  spatstat gives  2  matrixStats gives  2  and base R gives  2  as median
#> For type  2  and data  1 2 3  with weights  1 1 1  spatstat gives  1.5  matrixStats gives  2  and base R gives  2  as median
#> For type  4  and data  1 2 3  with weights  1 1 1  spatstat gives  1.5  matrixStats gives  2  and base R gives  1.5  as median
#> For type  1  and data  1 2 3  with weights  2 1 1  spatstat gives  1  matrixStats gives  1.5  and base R gives  1  as median
#> For type  2  and data  1 2 3  with weights  2 1 1  spatstat gives  1  matrixStats gives  1.5  and base R gives  1.5  as median
#> For type  4  and data  1 2 3  with weights  2 1 1  spatstat gives  1  matrixStats gives  1.5  and base R gives  1  as median
#> For type  1  and data  1 2 3 4  with weights  1 1 1 1  spatstat gives  2  matrixStats gives  2.5  and base R gives  2  as median
#> For type  2  and data  1 2 3 4  with weights  1 1 1 1  spatstat gives  2  matrixStats gives  2.5  and base R gives  2.5  as median
#> For type  4  and data  1 2 3 4  with weights  1 1 1 1  spatstat gives  2  matrixStats gives  2.5  and base R gives  2  as median

Created on 2024-05-03 with reprex v2.1.0

I therefore moved to https://rdrr.io/cran/matrixStats/man/weightedMedian.html which exposes my desired functionality.

Thank you for your work on this open library!

baddstats commented 5 months ago

Thank you for this example. It shows that there is indeed a bug in the implementation of weighted.quantile for the case type=2.

I have now fixed the bug so that the result of weighted.quantile(1:3, rep(1,3), probs=0.5, type=2) is 2.

Note that weighted.quantile has now moved from spatstat.geom to the new package spatstat.univar.

SlowMo24 commented 5 months ago

thank you, I can confirm that the issue is fixed in the latest snapshot of https://github.com/spatstat/spatstat.univar though https://github.com/spatstat/spatstat.univar/commit/0d98cd929d4aa2770b5a620bdb618196854e8ce9 : the values now align for all three approaches on type=2 .