tomaszaba / ipccheckr

Toolkit for Performing IPC Acute Malnutrition-related Data Checks
GNU General Public License v3.0
2 stars 0 forks source link

Probit approach for when SD is problematic #16

Closed tomaszaba closed 2 months ago

tomaszaba commented 2 months ago

Hi Ernest,

I have a question about how to approach the calculation of prevalence using the Probit approach. In SMART, when a standard deviation is out of range 0.8-1.2 (problematic), the observed prevalence is ignored and a new one calculated assuming a standard deviation of 1.0. I initially thought this could be achieve by applying this formula: zscore_value - the mean of zscore variable / standard deviation of the zscore variable which will result in a normalised zscore with mean = 0 and standard deviation of 1.0. So I wrote this function:

normalize_zscore <- function(x) {

  ## Get mean zscore ----
  mean_x <- mean(remove_flags(x, "zscore"), na.rm = TRUE)
  ## Get zscore's standard deviation ----
  std_x <- sd(remove_flags(x, "zscore"), na.rm = TRUE)
  ## Empty vector ----
  norm_x <- numeric(length(x))
  ## Normalize each element in x ----
  for (i in seq_along(x)) {
    norm_x[i] <- (x[i] - mean_x) / std_x
  }
  norm_x
}

which I then use as a helper function in the this other bigger/main function:

compute_wfhz_prevalence <- function(df, .wt = NULL, .edema = NULL, .summary_by) {
  ## Get and classify standard deviation ----
  x <- df[["wfhz"]]
  std <- classify_sd(sd(remove_flags(x, "zscore"), na.rm = TRUE))

  if (std != "Problematic") {
    ### Compute observed prevalence ----
    df <- with(
      df,
      define_wasting(df,
        zscore = .data$wfhz,
        edema = {{ .edema }},
        base = "wfhz"
      )
    )
    #### Create a survey object ----
    if (!is.null(.wt)) {
      srvy <- df |>
        as_survey_design(
          ids = .data$cluster,
          pps = "brewer",
          variance = "YG",
          weights = {{ .wt }}
        )
    } else {
      srvy <- df |>
        as_survey_design(
          ids = .data$cluster,
          pps = "brewer",
          variance = "YG"
        )
    }

    #### Summarise prevalence ----
    p <- srvy |>
      group_by({{ .summary_by }}) |>
      filter(.data$flag_wfhz == 0) |>
      summarise(
        across(
          c(.data$gam:.data$mam),
          list(
            n = \(.)sum(., na.rm = TRUE),
            p = \(.)survey_mean(.,
              vartype = "ci",
              level = 0.95,
              deff = TRUE,
              na.rm = TRUE
            )
          )
        ),
        wt_pop = round(sum(srvyr::cur_svy_wts()))
      )
    p
  }

  if (std == "Problematic") {
    ### Compute prevalence with a normalized zscores ----
    df <- df |>
      mutate(wfhz = normalize_zscore(.data$wfhz)) |>
      define_wasting(zscore = .data$wfhz, edema = {{ .edema }}, base = "wfhz")

    if (!is.null(.wt)) {
      srvy <- df |>
        as_survey_design(
          ids = .data$cluster,
          pps = "brewer",
          variance = "YG",
          weights = {{ .wt }}
        )
    } else {
      srvy <- df |>
        as_survey_design(
          ids = .data$cluster,
          pps = "brewer",
          variance = "YG"
        )
    }
    #### Summarise prevalence ----
    p <- srvy |>
      group_by({{ .summary_by }}) |>
      filter(.data$flag_wfhz == 0) |>
      summarise(
        across(
          c(.data$gam:.data$mam),
          list(
            n = \(.)sum(., na.rm = TRUE),
            p = \(.)survey_mean(.,
              vartype = "ci",
              level = 0.95,
              deff = TRUE,
              na.rm = TRUE
            )
          )
        ),
        wt_pop = sum(srvyr::cur_svy_wts())
      )
    }
  p
}

but the result I get is different to what ENA for SMART retrieves for when standard deviation is problematic. The rest of the code blocks works well.

So then I realised that they use the Probit approach. I wonder how should I approach this in ipccheckr?

I also wish to add Green's Index statistics (GI and p) to the summarized table return by the above function. As I find it difficult to work with nipnTK::greensindex() I wrote to Mark today, as you had advised once in April.

Thanks.

tomaszaba commented 2 months ago

Hi @ernestguevarra, I have addressed this using the base::norm(): I modified the helper function that was meant to normalise zscore into the following:

normalize_prevalence <- function(x, .status = c("gam", "sam")) {
  .status <- match.arg(.status)
  mean_x <- mean(remove_flags(x, "zscore"), na.rm = TRUE)
  ## Return GAM and SAM prevalence with a SD = 1
  switch(
    .status,
    "gam" = {pnorm(q = -2, mean = mean_x, sd = 1, lower.tail = TRUE, log.p = FALSE)},
    "sam" = {pnorm(q = -3, mean = mean_x, sd = 1, lower.tail = TRUE, log.p = FALSE)}
  )
}

Then I use in the main function, the in the code block for when SD is problematic

compute_wfhz_prevalence <- function(df,
                                    .wt = NULL,
                                    .edema = NULL,
                                    .summary_by) {
  ## Get and classify standard deviation ----
  x <- df[["wfhz"]]
  std <- classify_sd(sd(remove_flags(x, "zscore"), na.rm = TRUE))

  if (std != "Problematic") {
    ### Compute observed prevalence ----
    p <- df |>
      get_wfhz_prevalence_estimates(
        .wt = {{ .wt }},
        .edema = {{ .edema }},
        .summary_by = {{ .summary_by }}
      )
  } else {
    ### Compute prevalence with a standard deviation of 1 ----
    p <- dplyr::tibble(
      gam = normalize_prevalence(x, .status = "gam"),
      sam = normalize_prevalence(x, .status = "sam"),
      mam = gam - sam
    )
  }
  p
}

This leads to the same results as in ENA when such conditions are met.

The only pending issue on this and all other prevalence functions is about the issue I opened few days ago where I think I need to write a for loop to iterate over an intermediate data frame of multiple survey areas and allow the prevalence to be computed on a case-by-case scenario.

what do you think of the above approach?