rubenarslan / codebook

Cook rmarkdown codebooks from metadata on R data frames
https://rubenarslan.github.io/codebook/
Other
142 stars 16 forks source link

`detect_missing` does not work with integer columns #47

Closed chengchou closed 4 years ago

chengchou commented 4 years ago

I was trying to use detect_missing to clean the missing data in a dataset, of which a few columns are integers. detect_missing cannot correctly label the missing values.

Consider the following dataset rd1:

rd1 <- tibble(
  x1 = haven::labelled(x = c(32L, 996L, 40L),
    labels = c("Refused to answer" = 996), label = "x1 variable (integer)"),
  x2 = haven::labelled(x = c(32, 996, 40),
    labels = c("Refused to answer" = 996), label = "x1 variable (double)")
)
# Here is the output of `rd1`:
# A tibble: 3 x 2
#                      x1                      x2
#               <int+lbl>               <dbl+lbl>
#  32                      32  
# 996 [Refused to answer] 996 [Refused to answer]
#  40                      40  

The only difference between x1 and x2 is that x1 has only integers. Applying detect_missing will only affect x2, but 996 in x1 remains unchanged.

detect_missing(rd1, missing = c(996))
# # A tibble: 3 x 2
#                        x1                              x2
#                 <int+lbl>                       <dbl+lbl>
#    32                        32  
#   996 [Refused to answer] NA(a) [[996] Refused to answer]
#    40                        40  
# Warning message:
# In detect_missing(rd1, missing = c(996)) :
#   Cannot label missings for integers in variable x1

I looked into the codes of detect_missing and found that the problem was that the function haven::tagged_na does not work with vectors of integers. So you include the condition is.double in a few if statements.

If I modified these these lines (below) by removing the check for is.double, detect_missing will work for columns of integers by converting these columns of integers to column of double. I understand that converting integer to double could cause problems later, but it might not be a bad idea to add an option of letting users allow for the conversion so that missing values can be labelled correctly for integer columns.

detect_missing2 function below is a simple modification of the current detect_missing by adding an extra option force_integer = TRUE or FALSE. (The changes are highlighted.) When force_integer = TRUE, the integer columns will be converted to double and missing values will be labelled.

detect_missing2 <- function (data, only_labelled = TRUE, negative_values_are_missing = TRUE, 
    ninety_nine_problems = TRUE, learn_from_labels = TRUE, missing = c(), 
    non_missing = c(), vars = names(data), use_labelled_spss = FALSE, force_integer = FALSE) 
{
    for (i in seq_along(vars)) {
        var <- vars[i]
        if (is.numeric(data[[var]]) && any(!is.na(data[[var]]))) {
            potential_missing_values <- c()
            if (negative_values_are_missing) {
                potential_missing_values <- unique(data[[var]][data[[var]] < 
                  0])
            }
            labels <- attributes(data[[var]])$labels
            if (learn_from_labels && length(labels)) {
                numeric_representations <- as.numeric(stringr::str_match(names(labels), 
                  "\\[([0-9-]+)\\]")[, 2])
                potentially_untagged <- numeric_representations[is.na(labels)]
                potential_tags <- labels[is.na(labels)]
                if (is.double(data[[var]]) && !all(is.na(haven::na_tag(data[[var]]))) && 
                  length(intersect(potentially_untagged, data[[var]]))) {
                  # For integer vectors, their missing values cannot be tagged,
                  # so we don't need to modify the above if condition for
                  # integer vectors.
                  warning("Missing values were already tagged in ", 
                    var, ". Although", "there were further potential missing values as indicated by", 
                    "missing labels, this was not changed.")
                } else {
                  for (e in seq_along(potentially_untagged)) {
                    pot <- potentially_untagged[e]
                    data[[var]][data[[var]] == pot] <- potential_tags[e]
                  }
                }
            }
            if (ninety_nine_problems) {
                if (any(!is.na(data[[var]])) && (stats::median(data[[var]], 
                  na.rm = TRUE) + stats::mad(data[[var]], na.rm = TRUE) * 
                  5) < 99) {
                  potential_missing_values <- c(potential_missing_values, 
                    99)
                }
                if (any(!is.na(data[[var]])) && (stats::median(data[[var]], 
                  na.rm = TRUE) + stats::mad(data[[var]], na.rm = TRUE) * 
                  5) < 999) {
                  potential_missing_values <- c(potential_missing_values, 
                    999)
                }
            }
            potential_missing_values <- union(setdiff(potential_missing_values, 
                non_missing), missing)
            if ((!only_labelled || haven::is.labelled(data[[var]])) && 
                length(potential_missing_values) > 0) {
                if (only_labelled) {
                  potential_missing_values <- potential_missing_values[potential_missing_values %in% 
                    labels]
                  potential_missing_values <- union(potential_missing_values, 
                    setdiff(labels[is.na(labels)], data[[var]]))
                }
                potential_missing_values <- sort(potential_missing_values)
                with_tagged_na <- data[[var]]
                if (is.double(data[[var]])) {
                  free_na_tags <- setdiff(letters, haven::na_tag(with_tagged_na))
                } else {
                  free_na_tags <- letters
                }
                for (i in seq_along(potential_missing_values)) {
                  miss <- potential_missing_values[i]
                  if (!use_labelled_spss && !all(potential_missing_values %in% 
                    free_na_tags)) {
                    new_miss <- free_na_tags[i]
                  } else {
                    new_miss <- potential_missing_values[i]
                  }
                  that_label <- which(labels == miss)
################################################################################
                  # I replaced `is.double(data[[var]])` with `(force_integer |
                  # is.double(data[[var]]))` below
                  if (length(which(with_tagged_na == miss)) && 
                    (force_integer | is.double(data[[var]])) && !use_labelled_spss) {
                    with_tagged_na[which(with_tagged_na == miss)] <- haven::tagged_na(new_miss)
                  } else if (!force_integer & is.integer(data[[var]])) {
                    warning("Cannot label missings for integers in variable ", 
                      var, " let force_integer = TRUE if you want to label misssings for integers.")
                  }
                  if ((force_integer | is.double(data[[var]])) &&
                    length(that_label) && !use_labelled_spss) {
                    labels[that_label] <- haven::tagged_na(new_miss)
                    names(labels)[that_label] <- paste0("[", 
                      potential_missing_values[i], "] ", names(labels)[that_label])
                  }
################################################################################
                }
                if (use_labelled_spss) {
                  labels <- attributes(data[[var]])$labels
                  if (is.null(labels)) {
                    labels <- potential_missing_values
                    names(labels) <- "autodetected unlabelled missing"
                  }
                  data[[var]] <- haven::labelled_spss(data[[var]], 
                    label = attr(data[[var]], "label", TRUE), 
                    labels = labels, na_values = potential_missing_values, 
                    na_range = attr(data[[var]], "na_range", 
                      TRUE))
                } else if (haven::is.labelled(data[[var]])) {
                  data[[var]] <- haven::labelled(with_tagged_na, 
                    label = attr(data[[var]], "label", TRUE), 
                    labels = labels)
                } else {
                  data[[var]] <- with_tagged_na
                }
            }
        }
    }
    data
}
rubenarslan commented 4 years ago

hi @chengchou Here's some background on why labelled missing values don't work for integers. I haven't checked your code yet, but I would name the argument something like coerce_integer_to_double and I would have to add tests that integers without labelled missing values don't get coerced. I don't consider it very urgent though, as people could simply convert integers beforehand. Have you ever sent a pull request?

chengchou commented 4 years ago

@rubenarslan "tests that integers without labelled missing values don't get coerced" is a good point. coerce_integer_to_double is self-explaining---good name. No I haven't sent a pull request before.

chengchou commented 4 years ago

@rubenarslan Sorry, where does detect_missing locate with R directory?

rubenarslan commented 4 years ago

https://github.com/rubenarslan/codebook/blob/master/R/correct_attributes.R