tnagler / wdm-r

Weighted Dependence Measures
Other
3 stars 1 forks source link

Spearman bug with weights? #8

Open AndriSignorell opened 1 year ago

AndriSignorell commented 1 year ago

Hi Thomas I suppose there's a bug in your spearman code when using weights. From my understanding the following should hold:

# Example taken from: http://www.math.wpi.edu/saspdf/stat/chap28.pdf
# pp. 1349

library(DescTools)

Pain <- as.table(matrix(c(26, 26, 23, 18, 9, 6, 7, 9, 14, 23), 
                        nrow=5, 
                        dimnames = list(Dose = c("0", "1", "2", "3", "4"), 
                                        Adverse = c("No", "Yes")))) 

Desc(Pain, verb=3)

# consistent with the SAS results
with(Untable(Pain), cor(N(Adverse), N(Dose), method = "spearman"))
DescTools:::SpearmanRho(Pain)

# correct:
with(Untable(Pain),
     wdm::indep_test(as.numeric(Adverse), as.numeric(Dose), 
                     method = "spearman"))

# ******************************
# wrong:
with(as.data.frame(Pain),
     wdm::indep_test(as.numeric(Adverse), as.numeric(Dose), 
                     method = "spearman", weights = Freq))
# ***********************

# all correct (and consistent with SAS):
with(Untable(Pain),
     wdm::indep_test(as.numeric(Adverse), as.numeric(Dose), 
                     method = "pearson"))
with(as.data.frame(Pain),
     wdm::indep_test(as.numeric(Adverse), as.numeric(Dose), 
                     method = "pearson", weights = Freq))

with(Untable(Pain),
     wdm::indep_test(as.numeric(Adverse), as.numeric(Dose), 
                     method = "kendall"))
with(as.data.frame(Pain),
     wdm::indep_test(as.numeric(Adverse), as.numeric(Dose), 
                     method = "kendall", weights = Freq))
tnagler commented 1 year ago

This is a conceptual discrepancy. In general, weights for observations do not necessarily correspond to frequency counts. For example, when estimating a conditional Spearman's rho, you would upweight observations in some neighboorhood of the covariate value and downweight all others. The weights can be any positive real number, not just integers, so they can't in general be interpreted as repeated observations. In contrast, the freq procedure in SAS is specifically intended for frequency tables.

In your specific example, SAS interprets the frequency table as a larger data set with repeated observations (and, hence, many ties). wdm takes the frequency table as just four observations with some weight assigned; there are no ties. For Spearman's rho, we get different a different result because the "mid-rank" is computed differently (ties in SAS vs no ties in wdm). Also, wdm's independence test isn't useful because it is based on an asymptotic approximation with just 4 observations (see the p-values, also for Kendall's tau and Pearson correlation).

I see why the "weight = count" perspective can be useful though and might work this into the package at some point.

AndriSignorell commented 1 year ago

Thanks for clarification! The option would indeed be a welcome addition.