Test reliability - Githubissues

Pozdniakov commented 5 years ago

Any ideas how to calculate test scores reliability?

IanEisenberg commented 5 years ago

You can get the entire chain for each parameter.

# assume fitted_model is the TIRT_model$fit
components = rstan::extract(fitted_model)
# eta estimated is just the mean of the chain. This is the same thing you get from calling "predict"
# the first dimension is the chain
eta = apply(components$eta,c(2,3),mean)
# you can also calculated the standard deviation
eta = apply(components$eta,c(2,3),sd)

I'm sure there are better ways to do this, but they are all likely a function of interacting with the chains!

paul-buerkner commented 5 years ago

You can use the predict function to extract estimates and then compute the SD on that basis, which is an estimate of the RMSE and well calibrated (as least of the stan implementation) according to our paper (doi:10.1177/0013164419832063).

simon-buettner commented 5 years ago

If you're looking for an X reliability value (something more psychologist could work with), empirical reliability might be of use. It's sometimes used in IRT contexts and implemented in the 'mirt' package by Phil Chalmers as empirical_rxx. For some context, you could look here, where he explains the difference between marginal and empirical reliability: https://stats.stackexchange.com/questions/427631/difference-between-empirical-and-marginal-reliability-of-an-irt-model/428054#428054

Adapting this, you would need the Thetas and SEs as provided by predict, and then calculate this: reliability <- var(Theta)/(var(Theta) + colMeans(SE^2))

However, I'm not sure if this really is applicable in a TIRT context, especially in a Bayesian framework. Since this just gives us a ratio of variance to variance + standard error, I don't see why it shouldn't be useful to determine how well a TIRT test can differentiate between persons. If this doesn't make sense, I'd be glad to be corrected though!

paul-buerkner commented 4 years ago

I am closing this issue since there seem to be several reasonable proposals put out here that can be used if required.

paul-buerkner / thurstonianIRT

Test reliability #16