ryentes / careless

Other
25 stars 8 forks source link

Resampled Individual Reliability Function Code #39

Open vittorio-g opened 7 months ago

vittorio-g commented 7 months ago

As previously discussed with @ryentes I'm also posting the code to compute the Resasmpled Individual Reliability Index. This should provide a more reliable way to assess the individual reliability compared to the evenodd index. In fact, as it is shown in the following graph (showing the value of the mean index value against the number of iterations involed in the computation), the variance of the mean is quite high when computng only one iteration (as it's the case for the "evenodd" function). The value becomes more stable around 10 iterations.

image

The following is the code I used to plot the graph:

library(abind)

rir <- function(x,factors,iterations){
  #objects
  df1_mean <- matrix(nrow = nrow(x),
                     ncol = length(factors)) %>% data.frame()

  df2_mean <- matrix(nrow = nrow(x),
                     ncol = length(factors)) %>% data.frame()

  ir <- matrix(nrow = nrow(x),
               ncol = iterations) %>% data.frame()

  for (l in 1:iterations){

    #Objects
    a <- 0
    j <- 1
    for (i in factors){

      #section selection
      a <- a+i
      sec <- x[,(a-i+1):a]

      #random sampling
      sel <- srswor(round(ncol(sec)/2),ncol(sec)) %>% as.logical()

      #mean calculation
      df1_mean[,j] <- apply(sec[sel],1,mean,na.rm=T)
      df2_mean[,j] <- apply(sec[!sel],1,mean,na.rm=T)

      j <- j+1  
    }

    #Correlation
    ir[,l] <- sapply(1:nrow(df1_mean), function(x){
      cor(
        as.numeric(df1_mean[x,]),
        as.numeric(df2_mean[x,]),
        use="complete.obs"
      )
    }
    )
  }

  meanIr <- apply(ir,1,mean)
  riri <- -2*meanIr/(1 + meanIr)
  riri
}

itMeans <- c()
totMeans <- c()

for (l in c(1,5,10,15,20,25)){

  for (i in 1:10){
    itMeans[i] <- mean(rir(careless_dataset2,rep(10,10),l))
  }

  totMeans <- rbind(totMeans,
                    cbind(rep(l,10),itMeans))

}

plot(totMeans, main="Distribution of the mean each number of iterations")

The following is the code I'm proposing for a "Resampled Individual Reliability" function:

rir <- function(x,factors){

iterations <- 20

#objects
  df1_mean <- data.frame(matrix(nrow = nrow(x),
                     ncol = length(factors)))

  df2_mean <- data.frame(matrix(nrow = nrow(x),
                     ncol = length(factors)))

  ir <- data.frame(matrix(nrow = nrow(x),
               ncol = iterations))

  for (l in 1:iterations){
    ##Cycle for mean calculation
    #Objects
    a <- 0
    j <- 1

     for (i in factors){

        #section selection
        a <- a+i
        sec <- x[,(a-i+1):a]

        #random sampling
        sel <- as.logical(srswor(round(ncol(sec)/2),ncol(sec))) 

        #means
        df1_mean[,j] <- apply(sec[sel],1,mean,na.rm=T)
        df2_mean[,j] <- apply(sec[!sel],1,mean,na.rm=T)

        j <- j+1  
    }

    #Correlation
    ir[,l] <- sapply(1:nrow(df1_mean), function(x){
      cor(
        as.numeric(df1_mean[x,]),
        as.numeric(df2_mean[x,]),
        use="complete.obs"
      )
    }
    )
  }

  meanIr <- apply(ir,1,mean)
  riri <- -2*meanIr/(1 + meanIr)
  riri
}
franciscowilhelm commented 7 months ago

Hi, I just checked the code and it looks interesting. Sorry I could not attend the meeting between you guys, I was still on easter vacation. I am able to reproduce your example with a convergence around 10 iterations, very interesting! Vittorio, would it be possible to adapt the code to work with NA data? The current irv() function supports this.

For instance,

#with NA
careless_dataset2_na <- careless_dataset2
careless_dataset2_na[c(5:8),] <- NA #entire observation NA
careless_dataset2_na[1,3] <- NA #single value of obs is NA

riroutput_na <- rir(careless_dataset2_na, rep(10,10))

throws an error; I believe this is due to "use="complete.obs" in the cor() call. Excuse me if you went over this already in the meeting.

vittorio-g commented 7 months ago

Hi! Nice to meet you.

You are absolutely right, when there is a full row of NA the function gives an error. And you were spot on on the cause, I just changed "use" to "everything" and now it works.

I also changed the number of iterations to 10, because frankly there was no reason to keep them at 20.

I'll leave the new code below.

library(sampling)

rir <- function(x,factors){

  iterations <- 10

  #objects
  df1_mean <- data.frame(matrix(nrow = nrow(x),
                     ncol = length(factors)))

  df2_mean <- data.frame(matrix(nrow = nrow(x),
                     ncol = length(factors)))

  ir <- data.frame(matrix(nrow = nrow(x),
               ncol = iterations))

  for (l in 1:iterations){

    #Objects
    a <- 0
    j <- 1

    for (i in factors){

      #section selection
      a <- a+i
      sec <- x[,(a-i+1):a]

      #random sampling
      sel <- as.logical(srswor(round(ncol(sec)/2),ncol(sec))) 

      #means
      df1_mean[,j] <- apply(sec[sel],1,mean,na.rm=T)
      df2_mean[,j] <- apply(sec[!sel],1,mean,na.rm=T)

      j <- j+1  
    }

    #Correlation
    ir[,l] <- sapply(1:nrow(df1_mean), function(x){
      cor(
        as.numeric(df1_mean[x,]),
        as.numeric(df2_mean[x,]),
        use="everything"
      )
    }
    )
  }

  meanIr <- apply(ir,1,mean)
  riri <- -2*meanIr/(1 + meanIr)
  riri
}
franciscowilhelm commented 6 months ago

Ciao again, @vittorio-g thank your for the updated function! Do you have a description of the function for the package documentation? It should briefly outline what it does, why it could be preferrable to evenodd and how it differs, and what the numeric ranges are and indicate. I compared it against evenodd and it correlates highly and not redundantly (>.70). Unlike evenodd its not restricted to -1 to +1, but can have values above 1. With use = "everything", it calculates the within-person correlation coefficient for persons who do not have any NA. Could it be calculated also with use = ""pairwise.complete.obs" for persons who have at least some available data? Sorry for the many questions 😄.

Below is the slightly modified code with the documentation skeleton structure.

#' Resampled Individual Reliability / Even-Odd
#'
#' @param x a matrix of data (e.g. survey responses)
#' @param factors factors a vector of integers specifying the length of each
#' factor in the dataset
#' @param iterations Number of iterations of resampling
#' @return
#' @export
#' @author Vittorio G.

#' @examples
rir <- function(x,factors, iterations=10){

    #objects
    df1_mean <- data.frame(matrix(nrow = nrow(x),
                                  ncol = length(factors)))

    df2_mean <- data.frame(matrix(nrow = nrow(x),
                                  ncol = length(factors)))

    ir <- data.frame(matrix(nrow = nrow(x),
                            ncol = iterations))

    for (l in 1:iterations){

        #Objects
        a <- 0
        j <- 1

        for (i in factors){

            #section selection
            a <- a+i
            sec <- x[,(a-i+1):a]

            #random sampling

            #srswor function from sampling package, by Yves Tille and Alina Matei
            srswor_local <- function (n, N)
            {
                s <- rep(0, times = N)
                s[sample.int(N, n)] <- 1
                s
            }

            sel <- as.logical(srswor_local(round(ncol(sec)/2),ncol(sec)))

            #means
            df1_mean[,j] <- apply(sec[sel],1,mean,na.rm=T)
            df2_mean[,j] <- apply(sec[!sel],1,mean,na.rm=T)

            j <- j+1
        }

        #Correlation
        ir[,l] <- sapply(1:nrow(df1_mean), function(x){
            cor(
                as.numeric(df1_mean[x,]),
                as.numeric(df2_mean[x,]),
                use="everything"
            )
        }
        )
    }

    meanIr <- apply(ir,1,mean)
    riri <- -2*meanIr/(1 + meanIr)
    riri
}
vittorio-g commented 6 months ago

Hi again,

For the description of the function I'd say we can use something like this (I used the Even-Odd function phrasing when possible): "rir computes the Resampled Individual Reliability index. It divides unidimensional scales randomly sampling their items; two scores, one for each half, are then computed as the average response across subscale items. A within-person correlation is computed based on the two sets of subscale scores for each scale. Finally, the computation is repeated a number of times equal to the “iterations” value (default of 10), the result is the mean value of all iterations. The Resampled Individual Reliability index is a more reliable version of the Even-Odd index as it reduces the error due to the selection of odd subgroups thanks to the averaging of multiple draws. Its values range from -1 to +∞. High values indicate a low respondent’s reliability, low values implicate a good reliability.”

If reference is needed, I based the index on the work of Curran (2016).

The values above 1 should be due to the Spearman-Brown profecy correction. Do you think they are a problem?

I think the correlation could be better calcuated using use = "pairwise.complete.obs" too. I'll leave the updated code below, I also added my e-mail and an example to the documentation skeleton structure.

Curran, P. G. (2016). Methods for the detection of carelessly invalid responses in survey data. In Journal of Experimental Social Psychology (Vol. 66, pp. 4–19). Elsevier BV. https://doi.org/10.1016/j.jesp.2015.07.006

#' Resampled Individual Reliability / Even-Odd
#'
#' @param x a matrix of data (e.g. survey responses)
#' @param factors factors a vector of integers specifying the length of each
#' factor in the dataset
#' @param iterations Number of iterations of resampling
#' @return
#' @export
#' @author Vittorio Guerrieri \email{vittoguerri@gmail.com}
#' @examples
#' careless_rir <- rir(careless_dataset, rep(5,10))

rir <- function(x,factors, iterations=10){

    #objects
    df1_mean <- data.frame(matrix(nrow = nrow(x),
                                  ncol = length(factors)))

    df2_mean <- data.frame(matrix(nrow = nrow(x),
                                  ncol = length(factors)))

    ir <- data.frame(matrix(nrow = nrow(x),
                            ncol = iterations))

    for (l in 1:iterations){

        #Objects
        a <- 0
        j <- 1

        for (i in factors){

            #section selection
            a <- a+i
            sec <- x[,(a-i+1):a]

            #random sampling

            #srswor function from sampling package, by Yves Tille and Alina Matei
            srswor_local <- function (n, N)
            {
                s <- rep(0, times = N)
                s[sample.int(N, n)] <- 1
                s
            }

            sel <- as.logical(srswor_local(round(ncol(sec)/2),ncol(sec)))

            #means
            df1_mean[,j] <- apply(sec[sel],1,mean,na.rm=T)
            df2_mean[,j] <- apply(sec[!sel],1,mean,na.rm=T)

            j <- j+1
        }

        #Correlation
        ir[,l] <- sapply(1:nrow(df1_mean), function(x){
            cor(
                as.numeric(df1_mean[x,]),
                as.numeric(df2_mean[x,]),
                use="pairwise.complete.obs"
            )
        }
        )
    }

    meanIr <- apply(ir,1,mean)

    #Spearman-Brown profecy correction and negation of scores for compatibility with other indices
    riri <- -2*meanIr/(1 + meanIr)

    riri
}
franciscowilhelm commented 6 months ago

Great! I applied the same adjustment to the spearman-brown correction as in evenodd, now values range from -1 to 1. The function is in the evenodd_update branch. Also given below for convenience.

Tests comparing it to the even-odd function show correlations between .68-.78 with the sample dataset, varying depending on the iterations.

#' Resampled Individual Reliability / Even-Odd
#'
#' computes the Resampled Individual Reliability index. It divides unidimensional scales into two halves by randomly sampling their items; two scores, one for each half, are then computed as the average response across subscale items. A within-person correlation is computed based on the two sets of subscale scores for each scale. Finally, the computation is repeated a number of times equal to the “iterations” value (default of 10), the result is the mean value of all iterations.
#' The Resampled Individual Reliability index is a more reliable version of the Even-Odd index as it reduces the error due to the selection of odd subgroups thanks to the averaging of multiple draws.
#' Its values range from -1 to +1. High values indicate careless responding, low values indicate effortful responding.

#' @param x a matrix of data (e.g. survey responses)
#' @param factors factors a vector of integers specifying the length of each factor in the dataset
#' @param iterations Number of iterations of resampling. Set to 10 by default.
#' @details Iterations of resampling are set to 10 by default, providing a good tradeoff between speed and accuracy based on tests. However, further validation is needed.
#' @return
#' @export
#' @author Vittorio Guerrieri \email{vittoguerri@gmail.com}
#' @references Based on Curran, P. G. (2016). Methods for the detection of carelessly invalid responses in survey data. In Journal of Experimental Social Psychology (Vol. 66, pp. 4–19). Elsevier BV. https://doi.org/10.1016/j.jesp.2015.07.006
#' @examples
#' careless_rir <- rir(careless_dataset, rep(5,10))

rir <- function(x,factors, iterations=10){

    #objects
    df1_mean <- data.frame(matrix(nrow = nrow(x),
                                  ncol = length(factors)))

    df2_mean <- data.frame(matrix(nrow = nrow(x),
                                  ncol = length(factors)))

    ir <- data.frame(matrix(nrow = nrow(x),
                            ncol = iterations))

    for (l in 1:iterations){

        #Objects
        a <- 0
        j <- 1

        for (i in factors){

            #section selection
            a <- a+i
            sec <- x[,(a-i+1):a]

            #random sampling

            #srswor function from sampling package, by Yves Tille and Alina Matei
            srswor_local <- function (n, N)
            {
                s <- rep(0, times = N)
                s[sample.int(N, n)] <- 1
                s
            }

            sel <- as.logical(srswor_local(round(ncol(sec)/2),ncol(sec)))

            #means
            df1_mean[,j] <- apply(sec[sel],1,mean,na.rm=T)
            df2_mean[,j] <- apply(sec[!sel],1,mean,na.rm=T)

            j <- j+1
        }

        #Correlation
        ir[,l] <- sapply(1:nrow(df1_mean), function(x){
            cor(
                as.numeric(df1_mean[x,]),
                as.numeric(df2_mean[x,]),
                use="pairwise.complete.obs"
            )
        }
        )
    }

    meanIr <- apply(ir,1,mean)

    #Spearman-Brown profecy correction and negation of scores for compatibility with other indices
    riri <- -2*meanIr/(1 + abs(meanIr))

    riri
}
vittorio-g commented 6 months ago

Fantastic! If you need anything else don't hesitate to ask :)