sfirke / janitor

simple tools for data cleaning in R
http://sfirke.github.io/janitor/
Other
1.4k stars 133 forks source link

Feature Request: Find the nearest value within a set #243

Closed billdenney closed 6 years ago

billdenney commented 6 years ago

In my data cleaning workflow, I often receive or extract data that should come from a limited set of values. The challenge is that the actual data doesn't have those values within. Most commonly, this occurs when extracting data from an image (using https://automeris.io/WebPlotDigitizer/ or similar).

Values may come in like: 1.01, 0.99, 2.04, 2.08, 4.2, 4.1

When they should come from the set: 1, 2, 4

I have a function (below) that is able to clean these data to match the closest value within the set. Is this of interest for janitor?

#' Find the nearest value within a set of choices.
#'
#' @details When cleaning dirty input data, it can be useful to find the value nearest a value known to be relevant.  \code{find_nearest} finds the nearest value from \code{choices} for values in \code{x}.
#'
#' \code{tie} is interpreted as:
#' * "first", "last": Use the first (or last) choice in the order provided if there is a tie.
#' * "median-first", "median-last": When more than 2 choices are equally distant, use the first (or last) of the middle by vector index.
#'
#' @param x A vector of values to search for the nearest match within \code{choices}.
#' @param choices A vector of values that \code{x} should match.
#' @param tie How to handle a tie (see details).
#' @param none What to return if no \code{choices} are provided.
#' @return A vector the same length as \code{x} with the nearest choice from \code{choices}.
#' @export
find_nearest <- function(x, choices, tie=c("first", "last", "median-first", "median-last"), none=NA) {
  tie <- match.arg(tie)
  if (length(choices) > 0) {
    choices <- sort(choices)
    distances <- abs(sapply(x, FUN="-", choices))
    apply(X=distances,
          MARGIN=2,
          FUN=function(x, tie) {
            ret <- which(x == min(x))
            if (length(ret) > 1) {
              if (tie == "first") {
                ret <- ret[1]
              } else if (tie == "last") {
                ret <- ret[length(ret)]
              } else if (tie == "median-first") {
                ret <- ret[floor(length(ret)/2)]
              } else if (tie == "median-last") {
                ret <- ret[ceiling(length(ret)/2)]
              } else {
                stop("Invalid value for 'tie' argument")
              }
            }
            choices[ret]
          },
          tie=tie)
  } else {
    warning("No choices given, returning NA")
    rep(none, length(x))
  }
}
sfirke commented 6 years ago

I think this is too niche for janitor. I'm trying to think of where it would belong. A package for a specific domain of work? MALDIquant::match.closest looks very similar:

c(1, 2, 4)[match.closest(c(1.01, 0.99, 2.04, 2.08, 4.2, 4.1), c(1, 2, 4))]
[1] 1 1 2 2 4 4

https://www.rdocumentation.org/packages/MALDIquant/versions/1.18/topics/match.closest

billdenney commented 6 years ago

That's perfectly fair. Thanks for finding match.closest()!