saudiwin / idealstan

idealstan offers item-response theory (IRT) ideal-point estimation for binary, ordinal, counts and continuous responses with time-varying and missing-data inference. Latent space model also included. Full and approximate Bayesian sampling with 'Stan' (www.mc-stan.org).
https://cran.r-project.org/web/packages/idealstan/index.html
GNU General Public License v2.0
51 stars 12 forks source link

GRM summary error finding item parameters + multiple difficulty parameters? #26

Open PatrickJEdwards opened 11 months ago

PatrickJEdwards commented 11 months ago

Hello Robert Kubinec,

I'm using your package to run a dynamic ordinal ideal point model. Note that I'm using the development version of your package. I have two issues to discuss with you:

  1. While trying to access the item discrimination + difficulty parameters with the summary function, I get the error Can't find the following variable(s) in the output: steps_votes .
  2. Why is there only one difficulty parameter per item when graded response models should have multiple difficulty parameters for each item (usually equal to the number of response categories - 1)?

I'll first provide some background on my data and the idealstan code I'm using. Then I'll provide more details on the two aforementioned issues.

My data consists of ordinal likert-type items over four time periods (2011, 2015, 2019, and 2022). Items from the 2011 & 2015 periods have 5 response categories, while items from the 2019 & 2022 periods have 4 response categories. For all items, I'm using model_type = 5, the ordinal IRT (graded response) ideal point model with no missing-data inflation.

Here's my code:

idealstan_object <- id_make(
  score_data = Nat_DAN_IRT_Data_long_GALTAN,
  person_id = "person_id",
  item_id = "VAA_item_questions",
  time_id = "year",
  group_id = "party_abbr",
  model_id = "idealstan_model",
  unbounded = FALSE,
  outcome_disc = "VAA_item_answers",
  ordered_id = "n_responses"
)
DAN_GALTAN_idealstan_output_PREFIX <- id_estimate(
  idealdata = idealstan_object,
  model_type = 5,
  vary_ideal_pts = "random_walk",
  fixtype = "prefix",
  const_type = "items",
  restrict_ind_high = "Alt2022_Q3",
  restrict_ind_low = "Alt2019_Q26",
  ncores = parallel::detectCores(),
  grainsize = 1,
  nchains = 4,
  id_refresh = 10,
  time_var = 0.5, 
  restrict_var = F
)

I'll now provide more information about the aforementioned issues.

Issue 1: Unaccessible Item Parameters.

This is the code I use to access the item parameters:

summary(DAN_GALTAN_idealstan_output_PREFIX, pars = "items")

That code produces the following error: Error: Can't find the following variable(s) in the output: steps_votes

After looking through the code myself, I found that this line of code in the .item_plot_ord_grm function in the Helpers.R file is causing the error:

total_cat <- length(as_draws_df(object@stan_samples$draws('steps_votes')))

I tried to fix the issue myself by modifying that code line and an additional line with the following:

# figure out how many categories we need
## I still don't really know what this is doing. I'm currently interpeting it as the number of unique response categories in the dataframe. I changed it so that it obtains this number from the score_matrix
total_cat <- unique(object@score_data@score_matrix[object@score_data@score_matrix$item_id == param_name,]$ordered_id)

# Obtain the cuts. I modified this significantly despite not being sure if this is how it works. Notice that it subtracts 1 from total_cat. This is because it includes the 'missing' category
cuts <- as_draws_df(object@stan_samples$draws(paste0('steps_votes_grm', total_cat, '[', param_num, ',', 1:(total_cat - 1), ']')))

Did I identify the cause of the error? And do you think that my modifications adequately fixed the problem?

Issue 2: Too-Few Difficulty Parameters

Secondly, it seems like the command summary(DAN_GALTAN_idealstan_output_PREFIX, pars = "items") only produces estimates for one difficulty parameter per item. Yet graded response models have multiple difficulty parameters for each ordered response category in an item. Do you know why this is happening? How can I obtain the multiple unique difficulty parameters?

PatrickJEdwards commented 11 months ago

After some additional thought and internet searching on the second issue, I realized that the cutpoints are the response category-specific difficulty parameters. Is this true for the graded response models?

If so, then that raises an additional question for you. Why do you use reg_diff in the following line of code to calculate the non-missing inflated midpoints (lines of indifference)?

reg_mid <- (reg_diff+cuts[[c]])/reg_discrim

In fact, I'm struggling to understand the role of the reg_diff parameter generally in the context of the graded response model. Any clarification on this additional point would be much appreciated.

saudiwin commented 11 months ago

hi @PatrickJEdwards - thanks for all these Qs, and I'll do my best to help you understand the package. the GRM does add some complexity as there are varying intercepts/cutpoints for each item (difficulties). When I wrote the package, this was not easy to code because Stan did not have varying-length arrays. It does now, and I could refactor, but that's a project for the future.

In any case, this model does work and it should have all the info you need. What would help me though is if you could share some subset of your data so I can see what you are doing and run it myself. Or perhaps you can generate/replace with fake data if you can't share it.

In response to your second Q, the GRM is parameterized by having C cutpoints for J items, so there is an array/matrix of dimension C x J. Each item also has an intercept, which is essentially the difficulty mentioned above. The actual difficulty is the result of intercept - cutpoint c for level c, which the code above is inverting to get the midpoint. (of course if you find a bug in the code, am happy to fix, but it looks correct).

PatrickJEdwards commented 11 months ago

Hi @saudiwin, thank you for the prompt reply! Thank you for answering my questions and clearing things up for me.

Attached is an anonymized subset of my data: idealstan_GithubExampleData_subset.csv

I believe the following code would reproduce the error:

#Create Danish GAL-TAN `idealstan` object:
idealstan_object <- id_make(
  score_data = Nat_DAN_IRT_Data_long_GALTAN,
  person_id = "person_id",
  item_id = "item_id",
  time_id = "time_id",
  group_id = "group_id",
  model_id = "model_id",
  unbounded = FALSE,
  outcome_disc = "outcome_disc",
  ordered_id = "ordered_id"
)

#Specify `idealstan` model with random-walk prior and PRE-FIXED CONSTANTS:
idealstan_output_PREFIX <- id_estimate(
  idealdata = idealstan_object,
  model_type = 5,
  vary_ideal_pts = "random_walk",
  fixtype = "prefix",
  const_type = "items",
  restrict_ind_high = "61",
  restrict_ind_low = "34",
  ncores = parallel::detectCores(),
  grainsize = 1,
  nchains = 4,
  id_refresh = 10,
  time_var = 1, #Default=10. Lower values = LESS conservative prior.
  restrict_var = F
)

#Estimate item parameters:
items_summary <- summary(idealstan_output_PREFIX, pars = "items")

#Should produce error: 
#`Error: Can't find the following variable(s) in the output: steps_votes`