Improve scoring method due to shrinkage

william-hula commented 2 years ago

Rob, Gerasimos, and Alex, In sussing out the CI stuff, I discovered that the Bayesian shrinkage we're getting with extreme score estimates may be too large to be tolerated, especially in the context of the precision added by the sd =10 prior. The concern is that a clinician might administer a 10 item short form in the acute setting, obtain 10 incorrect responses and get a T-score (sem) of 26.6 (5.01). and then administer the full test a month later on which the person also gets zero correct, leading to an estimate (sem) of 17.7 (4.01), which, if they apply the correct math (which they probably won't, unless we do it for them), will suggest that there is a t-tailed 92% probability that the patient has gotten worse. Now arguably, that might be an appropriate conclusion under normal circumstances, but Gerasimos convinced me this is something to be concerned about, As I write this, I'm think that a quick Monte Carlo simulation with a constant extreme low generating theta, comparing score estimates for CAT-10 and full test might be in order. In any case, if this shrinkage is too much to bear, options for addressing include EAP with a uniform prior or ML estimation with fences, with fences being two dummy items at the extremes of the ability range, that are always administered and scored correct (for the low one) and incorrect (for the high one), to put some bounds on the ML procedure. Apparently it produces less error than EAP (https://journals.sagepub.com/doi/full/10.1177/0146621616631317?casa_token=iSbh34Qp4woAAAAA%3A6VadRw5h1_Nhg3hA_vxQEjK1LrrAjAwCV5e3jHCgT_lHjPmkVG5g4mzAkQLJI6HCiCKVJgYHo61GXw). catR will implement this easily and I'm currently playing around with it.

rbcavanaugh commented 2 years ago

Hi Will - glad you're working this out. Just to clarify - the 10-item PNT that is current in the app is just for testing purposes and I had planned on removing it before the final release. Are you saying that you want to keep it in for situations like acute-care assessment?

william-hula commented 2 years ago

I'm open to others' thoughts on whether to keep a 10-item option. Given the existence of the 15-item BNT forms, I think letting people do 15 items is a good idea, especially since we have some evidence from our recent papers that our confidence intervals are reasonably accurate. I take your point about the utility of a score with a CI width of 20 points, but even that is probably good enough to tell you whether someone is more likely mild, moderate, or severe. And bear in mind, these extreme wide Cis will only occur with people at either extreme end of the ability spectrum, so if you get a score of 20 (95%CI: 10,30) or even 30 (95%CI: 20,40) , it's a pretty safe bet the person has severe anomia. I will work on a blurb about marginal reliability.

I wonder if Gerasimos has any objections to using marginal reliability?

william-hula commented 2 years ago

Here are a couple of plots of preliminary results from 10 and 30-item CAT simulations comparing EAP scoring with MLE with fences. EAP is with a normal(50,10) prior, the fences were at 5 and 95 with discrimination set to 3x the constant estimated discrimination. The theta generating distribution was uniform (5,95) with 1800 simulees, originally intended to give 200 each in 9 theta bands, but then I realized that catR divides the distribution up into deciles automatically, so that's what's plotted.

The upshot is that EAP is superior in the range of about 30 to 70, while MLEF is better in the tails, which is not unexpected.

bias_plot rmse_plot

I'm currently running another set of simulations parallel to this one, but using a skew normal theta distribution based on the empirical estimates from the latest sample of 335 mappd and r03 subjects. I'm running that one with 1000 simulees total.

I think the question here will be whether the worse performance of MLEF in the center of the theta distribution will be tolerable, given the shrinkage issue we've identified with EAP, which I think remains an issue even with a 30 item CAT and extreme response strings. Or are we worrying too much about edge cases that will affect a small number of users? I should be able to post the results from the empirically derived theta distribution later today. Given the higher density in the middle of the theta distribution there, I expect they'll show a larger average advantage for EAP.

Let me know if you'd like to see other conditions as well.

rbcavanaugh commented 2 years ago

We can also include a disclaimer stating that comparing tests of different numbers of items when performance is at the extremes can lead to less reliable results (or however you would like to say this). We can even automate this message by popping it up when a) the current test differs in the number of items as the previous test scores uploaded and/or b) one of the tests includes a final estimate < 25 or > 75. We can also include a brief explanation of why this happens
Is there any model stacking/averaging in IRT/CATR? It seems like combining information from both models might be advantageous.
Similarly, if the score is < 25 or > 75 would it be reasonable to provide an alternative final estimate using the MLEF approach?

william-hula commented 2 years ago

Ok, sorry folks, we need to hold up on interpreting those charts I sent. I took a deeper look at the simulation results and something is weird: the correlations between the generating and estimated thetas is in the 0.7-0.75 range which seems super low, and the scatter plots of estimated theta over generating theta look weird, for both MLEF and EAP, e.g.: I need some time to see what's going on here. It might have something to do with the T-score transformation (I doubt it, but maybe).

rbcavanaugh / pnt

Improve scoring method due to shrinkage #28