mastoffel / partR2

R package to partition R2 among predictors in Generalized linear mixed models
21 stars 3 forks source link

Inconsistent results between performance and partR2 #6

Closed tobadia closed 2 years ago

tobadia commented 2 years ago

Hey there,

An ongoing project has led me to attempt quantifying relative importance of covariates in a sort of strctured data analysis (in short, a binomial outcome with multiple layers of variations, so this involves a lot of nesting in some explanatory covariates). This led me to attempt to decompose the marginal R-squared measure of a final model. Upon investigating various questions, this led me to naïvely compare the outputs of performance::r2() and partR2::partR2(), and these seem to disagree even on the datasets from your vignette. Hence, this issue/discussion.

Replicable example :

## partR2:: and r2:: inconsistencies
library(partR2)
library(rptR)
library(lme4)
library(performance)

# Data
set.seed(12345)
data("biomass")
biomass$YearC <- biomass$Year - mean(biomass$Year)
biomass$TemperatureS <- scale(biomass$Temperature)
biomass$PrecipitationS <- scale(biomass$Precipitation)
biomass$SpeciesDiversityC <- biomass$SpeciesDiversity - mean(biomass$SpeciesDiversity)

# Model
modRecov <- glmer(cbind(Recovered, NotRecovered) ~ YearC + 
                    TemperatureS + PrecipitationS + SpeciesDiversityC + (1|Population),
                  data=biomass, family="binomial")

# Compare R2, ICC...
R2_Recov <- partR2(modRecov, partvars = c("TemperatureS", "PrecipitationS"),
                   R2_type = "marginal")
ICC_Recov <- rpt(cbind(Recovered, NotRecovered) ~ YearC + 
                   TemperatureS + PrecipitationS + SpeciesDiversityC + (1|Population),
                 data = biomass, 
                 datatype = "Binary", 
                 grname = c("Population"), 
                 nboot = 0, npermut = 0)

print(R2_Recov); print(r2(modRecov))
print(ICC_Recov); print(icc(modRecov))

Though both supposedly calculate R2 as per the 2017 paper from @itchyshin, the results differ for ICC and R2. The same goes for conditional and marginal values.

Any chance you'd have an idea where this originates from ?

Cheers Thomas

mastoffel commented 2 years ago

Hi Thomas,

thanks for raising this and for the reproducible example. There are a few things to unpack here, I'll try to go through them step by step:

We want to know the following variance components: var_fixed, var_random, var_dist (distribution-specific variance) to calculate R2 (and similarly R) like this:

R2_marginal = var_fixed / (var_fixed + var_random + var_dist)

1) The first key difference is that var_dist for binomial models with logit link is fixed at (pi^2)/3 in performance. For partR2 and rptR you can set this using expct='liability' but the default is expct=meanobs, which is just a slightly different way of calculating this variance component (see ?partR2 for details). So this is the first small difference.

2) For partR2 and rptR, we also decided to fit an observation-level random effect olre=TRUE by default for Poisson and binomial models with non-binary response variables. The OLRE accounts for overdispersion, and will change the results slightly too.

So, this means that performance::r2() and partR2::partR2() should give the same marginal R2 value when we adjust some parameters, but they are still not the same, and this seems to be because performance extracts or computes a different random effect variance component than partR2 (and indeed, lme4 itself) and I'm not sure what's going on there.

## partR2:: and r2:: inconsistencies
library(partR2)
library(lme4)
library(performance)
library(insight)

# Data
set.seed(12345)
data("biomass")
biomass$YearC <- biomass$Year - mean(biomass$Year)
biomass$TemperatureS <- scale(biomass$Temperature)
biomass$PrecipitationS <- scale(biomass$Precipitation)
biomass$SpeciesDiversityC <- biomass$SpeciesDiversity - mean(biomass$SpeciesDiversity)

# Model
modRecov <- glmer(cbind(Recovered, NotRecovered) ~ YearC + 
                      TemperatureS + PrecipitationS + SpeciesDiversityC + (1|Population),
                      data=biomass, family="binomial")

# Compare R2
# Set expct to (pi^2/3) and remove observation-level random effect
R2_Recov <- partR2(modRecov, partvars = c("TemperatureS", "PrecipitationS"),
                   R2_type = "marginal", expct = "liability", olre=FALSE)

# still not the same
print(R2_Recov); print(r2(modRecov))

# seems to be because the estimate for random effect variance is different in performance
# performance calculates R2 based on variance components from the insight package

# performance
insight::get_variance_random(modRecov) # 0.011

# partR2
partR2:::get_ran_var(modRecov) # 0.588

# lme4 VarCorr matches partR2 but not insight
as.numeric(lme4::VarCorr(modRecov)) # 0.588

# let's calculate R2s from its components 
vc_ins <- insight::get_variance(modRecov)
r2_ins <- vc_ins$var.fixed / (vc_ins$var.fixed + vc_ins$var.random + vc_ins$var.distribution) # 0.133

vc_part <- partR2:::get_var_comps(modRecov, overdisp_name = "none", expct = "liability")
r2_part <- vc_part$var_fix / (vc_part$var_fix + vc_part$var_ran + vc_part$var_res) # 0.116
tobadia commented 2 years ago

Wow, amazing. You've basically done what I had in mind (unpack the code and go check under the hood), but in no time. I guess knowing your own code definitely helps speeding this up!

Thanks for pointing out the two sources of difference. The pi^2/3 vs. residual variance in the random-effects compartment does make sense and I had missed that subtle tweak in the documentation. Why insight:: extracts a different variance than lme4:: however is an interesting question. Not aimed at you, though.

itchyshin commented 2 years ago

My quick thinking is that you need to add the observation-level random effect for the glmer model

This paper may help to know what is the observation level random effect

https://peerj.com/articles/1114/