tmatta / lsasim

Simulate large scale assessment data
6 stars 5 forks source link

irt_gen function #40

Closed Sinan-Yavuz closed 3 years ago

Sinan-Yavuz commented 3 years ago

Hello,

Thank you very much for creating "lsasim".

I have found a bug and want to share it with you.

irt_gen function doesn't produce a 0-1 matrix based on the parameters. I have changed the following to produce a proper dataset.

irt_gen2 <- function (theta, a_par = 1, b_par, c_par = 0, D = 1)
{
  response_pr <- c_par + (1 - c_par) * (exp(a_par*(theta - b_par)) / (1 + exp(a_par*(theta - b_par))))
  y <- rbinom(1, size = 1, prob = response_pr)
  return(y)
}

Let me know if you need me to provide you some examples, why irt_gen doesn't work.

wleoncio commented 3 years ago

Hi @Sinan-Yavuz, thank you for your contribution! We're gonna take a look at the issue and implement the necessary changes ASAP.

wleoncio commented 3 years ago

In the meantime, it'd be great if you could provide us with some examples why the current implementation of the function doesn't work, so we can add them to our unit test battery.

Sinan-Yavuz commented 3 years ago

Hi @wleoncio, thank you very much for your quick reply.

Here is the example;


response_gen2 <- function (subject, item, theta, a_par = NULL, b_par, c_par = NULL, 
                           d_par = NULL, item_no = NULL, ogive = "Logistic") 
{
  if (length(subject) != length(item)) 
    stop("subject and item vectors must be equal length.", 
         call. = FALSE)
  if (is.null(a_par)) 
    a_par <- rep(1, length(unique(item)))
  if (is.null(c_par)) 
    c_par <- rep(0, length(unique(item)))
  if (ogive == "Logistic") 
    DD <- 1
  if (ogive == "Normal") 
    DD <- 1.7
  if (is.null(item_no)) 
    item_no <- seq(length(unique(item)))
  if (is.null(d_par)) 
    b_pars <- split(b_par, seq(length(b_par)))
  if (!is.null(d_par)) {
    d_mat <- do.call("cbind", d_par)
    b_pars <- list()
    for (i in 1:length(b_par)) {
      if (sum(abs(d_mat[i, ])) != 0) {
        b_list <- list()
        for (j in 1:length(d_mat[i, ])) b_list[[j]] <- b_par[i] + 
            d_mat[i, j]
        b_pars[[i]] <- unlist(b_list)
      }
      if (sum(abs(d_mat[i, ])) == 0) 
        b_pars[[i]] <- b_par[i]
    }
  }
  names(b_pars) <- item_no

  y <- numeric(length(subject))
  for (n in 1:length(subject)) {
    y[n] <- irt_gen2(theta = theta[subject[n]], a_par = a_par[which(item_no == 
                                                                      item[n])], b_par = b_pars[[which(item_no == item[n])]], 
                     c_par = c_par[which(item_no == item[n])], D = DD)
  }
  df_l <- data.frame(item = item, subject = subject, response = y)
  df_w <- reshape(df_l, timevar = "item", idvar = "subject", 
                  direction = "wide")
  df_item_old <- colnames(df_w)[2:length(df_w)]
  df_item_num <- gsub("[^[:digit:]]", "", df_item_old)
  df_item_new <- ifelse(nchar(df_item_num) == 1, paste0("i00", 
                                                        df_item_num), ifelse(nchar(df_item_num) == 2, paste0("i0", 
                                                                                                             df_item_num), paste0("i", df_item_num)))
  colnames(df_w)[2:length(df_w)] <- df_item_new
  df_w <- df_w[, order(names(df_w))]
  rownames(df_w) <- NULL
  return(y = df_w)
}

irt_gen2 <- function (theta, a_par = 1, b_par, c_par = 0, D = 1)
{
  response_pr <- c_par + (1 - c_par) * (exp(a_par*(theta - b_par)) / (1 + exp(a_par*(theta - b_par))))
  y <- rbinom(1, size = 1, prob = response_pr)
  return(y)
}

set.seed(345)

#this is a normal example
subject <- 1:100
theta <- rnorm(100,0,1)
student_dt <- data.frame(subject, theta)

item <- as.integer(c(1:n_items))
a <- runif(n_items, 0.5, 1.5)
b <- runif(n_items, -3, 3)
c <- runif(n_items, 0, .15)
item_par <- data.frame(item, a, b, c)

resp_matrix <- response_gen(subject = sort(rep(subject, 30)), item = rep(item,100), theta = theta, a_par = a, b_par = b, c_par = c)

#item means make sense and everything looks fine
colMeans(resp_matrix[item])

lm(colMeans(resp_matrix[item]) ~ b)

#Let's make it very extreme, student mean theta is -20, I don't expect them to solve anything. But guessing parameters are super high. So, regardles if their theta 80%-90% of them should solve these items.
subject <- 1:100
theta <- rnorm(100, -20 ,1)
student_dt <- data.frame(subject, theta)

item <- as.integer(c(1:n_items))
a <- runif(n_items, 0.5, 1.5)
b <- runif(n_items, -3, 3)

#chance is super high
c <- runif(n_items, 0.8, .9)
item_par <- data.frame(item, a, b, c)

resp_matrix <- response_gen(subject = sort(rep(subject, 30)), item = rep(item,100), theta = theta, a_par = a, b_par = b, c_par = c)

colMeans(resp_matrix[item]) #look at the results, they are around 50%, what happened to .8-.9 guessing parameter
lm(colMeans(resp_matrix[item]) ~ b)

#let's check after the fix - but it only works for dichotomous variables
resp_matrix2 <- response_gen2(subject = sort(rep(subject, 30)), item = rep(item,100), theta = theta, a_par = a, b_par = b, c_par = c)
colMeans(resp_matrix2[item]) #look at the results, they are around 50%, what happened to .8-.9 guessing parameter
lm(colMeans(resp_matrix2[item]) ~ b)
pdbailey0 commented 3 years ago

Hi, I worked to identify this bug with @Sinan-Yavuz and the way you can tell this is a bug is that the intercept of the lm should move up as the guessing parameter increases.

Your formula works so long as the c parameter is 0. When it is not zero you make the wrong probability of a zero

The problem is that the binary is being treated like the polytomous case. In the binary case there is a guessing parameter and you do not want to divide by the denominator, the second element of the numerator is already the probability of a 1. c_par decreases, not increases the probability of a zero, but your formula increases it.

mollyolaf commented 3 years ago

Thanks, Paul and Sinan - that's super helpful. We'll sort this out. Glad you are using lsasim!

wleoncio commented 3 years ago

Checking the MRE

OK, so I've created a markdown document based on the code above. Please check it here and let me know if the output conforms to your expectations.

Fixing the bug

As for implementing the changes, I am not sure how to proceed. The irt_gen2() function proposed in the OP doesn't break any of the current unit tests, but I understand it only works for dichotomous variables, so I am not sure if it can simply replace irt_gen() and call it a day. I'd also like to understand a bit better the differences between lsasim::response_gen() and the response_gen2() function from your MRE.

pdbailey0 commented 3 years ago

I think you may want to break up dichotomous from polytomous cases. I've never seen a polytomous case with a c parameter.

I think you could add this line right before you sample

  response_pr[1] <- ifelse(c_par > 0, 1 - response_pr[2], response_pr[1])

so irt_get would be:

irt_gen <- function(theta, a_par = 1, b_par, c_par = 0, D = 1) {
  unsummed <- c(0, D * a_par * (theta - b_par))
  num <- exp(cumsum(unsummed))
  den <- sum(num)
  response_pr <- c_par + (1 - c_par) * (num / den)
  response_pr[1] <- ifelse(c_par > 0, 1 - response_pr[2], response_pr[1])
  y <- sample(1:length(response_pr) - 1, size = 1, prob = response_pr)
  return(y)
}

but I haven't tested this and I'm still not entirely sure what is vectorized here.

Another workaround would be to use Sinan's code when c_par > 0 and your existing code otherwise.

wleoncio commented 3 years ago

Good ideas, @pdbailey0, we'll work on that ASAP. 👍🏽

wleoncio commented 3 years ago

Changes implemented and will be included on the next CRAN release of the package. @Sinan-Yavuz and @pdbailey0, thank you once again for your contribution; we have added you as package contributors, please let us know if the information below needs any modification (e.g. typo or addition of a field such as e-mail):

https://github.com/tmatta/lsasim/blob/97934b455b6e2fc6520f38dc58d85f3fca862af9/DESCRIPTION#L17-L18

Sinan-Yavuz commented 3 years ago

@wleoncio, I am honored to be mentioned as a contributor though it is not necessary. I will make a modification to my name and email. At some point, if you change your mind and think my contribution is no longer valid or important, feel free to remove me.

wleoncio commented 3 years ago

Looking forward to the Pull Request containing the modifications, @Sinan-Yavuz!

Your and Paul's mentions were based on the documentation for the person() function (see extract below) and your authorship of the code on commits 5f3396fdd5075a6ac3181fcd2cb4252db0ca9086 and dfb145087bd12673d59730fbb899e10a38cae891.

‘"ctb"’ (Contributor) Use for authors who have made smaller contributions (such as code patches etc.) but should not show up in the package citation.

By "package citation", they mean the contents of citation("lsasim").