spsanderson / TidyDensity

Create tidy probability/density tibbles and plots of randomly generated and empirical data.
https://www.spsanderson.com/TidyDensity
Other
34 stars 1 forks source link

Error thrown by tidy_distribution_comparison #335

Closed rwhitt221 closed 1 year ago

rwhitt221 commented 1 year ago

Good morning.

First off, beautiful package. Does exactly what I want and I found it exactly when I needed it. Very batman.

Anyway there seems to be a slight issue with tidy_distribution_comparison. In the below instance it fails and throws an error.

vec <- c(4.21038180822507, 7.29227965464815, 4.44546814076602, 9.97021094430238,

  12.7220453182235, 12.2547338320874, 17.084160503, 16.9771589315496,

  13.9196691871621, 28.1079112994485, 21.9923905748874, 25.5631629657,

  29.2080506961793, 29.3103481503204, 32.8031855775043, 36.9184419861995,

  33.5645967372693, 39.6332762087695, 39.9071664153598, 33.1843491038308,

  44.3890771060251, 42.9038463486359, 45.1238870527595, 43.0803714995272,

  41.7760691861622, 54.1996862320229, 55.885584554635, 57.9292631917633,

  59.5848086406477, 52.1744495350868, 51.2742422847077, 57.6243086578324,

  63.3894142811187, 69.5936238579452, 69.1801660903729, 68.9220451354049,

  72.1439804229885, 74.8792171361856, 73.1369992066175, 75.2890477189794,

  71.6463644034229, 72.9336590087041, 82.5781239755452, 80.4327263310552,

  81.6488912375644, 87.7530873077922, 88.1146588502452, 97.3822132404894,

  90.5776990670711, 98.9856081269681, 104.934460432269, 109.452040330507,

  103.491765907966, 101.148026892915, 100.944263930432, 111.179752380121,

  113.427576716058, 113.864866825752, 126.105331936851, 124.856045653578,

  123.718750711996, 133.255937143695, 135.044311392121, 138.813832551241,

  137.335359374993, 149.589461353607, 141.588302822784, 142.527900461573,

  155.065253486391, 155.120701580308, 154.577696875203, 154.987382919062,

  169.435900163371, 165.575539364945, 162.548232614063, 175.468045480084,

  170.432634640019, 180.921393528115, 181.758784719277, 186.942011879291,

  191.366305502597, 198.260854685213, 205.45332423877, 206.720861082431,

  217.84482129151, 211.437231055461, 216.511908348184, 225.026640975848,

  224.402389789466, 238.559981840663, 237.435712569859, 234.596182941459,

  243.078954303637, 251.120868842117, 255.885675693862, 269.139523848426,

  262.087598224171, 273.004066818394, 276.28544652136, 285.043469499797,

  282.939362509642, 297.03274406027, 291.905623725615, 304.240023142193,

  307.524585318752, 318.777152581606, 313.089858465828, 325.841427897103,

  335.631692002062, 341.801408070605, 359.816143654753, 369.648197756615,

  378.400738651399, 384.350667928811, 397.20551763894, 405.518638305366,

  411.719402759336, 423.184550385922, 432.092252275907, 449.351194554474,

  460, 461.031961357221, 472.708591001574, 490, 493.59456238104,

  502.864125107881, 518.203699132428, 523.632218681742, 549.89218067145,

  560, 570, 580, 590, 610, 620, 630, 640, 650, 670, 680, 690, 700,

  710, 750, 760, 770, 780, 790, 810, 820, 850, 900, 920, 940, 1010,

  1020, 1040, 1090, 1130, 1150, 1200, 1250, 1310, 1470, 1490, 1650

)

tidy_distribution_comparison(vec, .distribution_type = "continuous")

This throws the error... Error in dplyr::mutate():

! Problem while computing d = list(...).

i The error occurred in group 1: sim_number = 1.

Caused by error in density.default():

! 'x' contains missing values

This seems to be coming from small values produced by util_exponential_param_estimate. Output below.

A tibble: 1 x 3

dist_type param_1 param_2

1 Exponential 0.00297 NA This fun doesn't produce an error so is not caught by the try and is then being rounded to 0. ``` e <- try(util_exponential_param_estimate(x_term)$parameter_tbl %>% dplyr::select(dist_type, rate) %>% dplyr::mutate(param_2 = NA) %>% purrr::set_names("dist_type", "param_1", "param_2")) if (!inherits(e, "try-error")) { te <- tidy_exponential(.n = n, .rate = round(e[[2]], 2)) } ``` Changing the round to a larger number cleared the error for me. Not sure if it's acceptable though! Thanks again.
spsanderson commented 1 year ago

I'm glad you find the package useful, I will try to recreate your error on my machine tomorrow and see what happens.

spsanderson commented 1 year ago

So I was able to reproduce, and if I change round to 3 the error in the function util_exponential_param_estimate() goes away, I will have to see if I can make the same change to all others as I would like to try and keep them uniform, or maybe I can add a parameter to the function(s) to allow an integer to be passed to round()

spsanderson commented 1 year ago

I think this will solve the issue:

#' Compare Empirical Data to Distributions
#'
#' @family Empirical
#'
#' @author Steven P. Sanderson II, MPH
#'
#' @details The purpose of this function is to take some data set provided and
#' to try to find a distribution that may fit the best. A parameter of
#' `.distribution_type` must be set to either `continuous` or `discrete` in order
#' for this the function to try the appropriate types of distributions.
#'
#' The following distributions are used:
#'
#' Continuous:
#' -  tidy_beta
#' -  tidy_cauchy
#' -  tidy_exponential
#' -  tidy_gamma
#' -  tidy_logistic
#' -  tidy_lognormal
#' -  tidy_normal
#' -  tidy_pareto
#' -  tidy_uniform
#' -  tidy_weibull
#'
#' Discrete:
#' -  tidy_binomial
#' -  tidy_geometric
#' -  tidy_hypergeometric
#' -  tidy_poisson
#'
#' The function itself returns a list output of tibbles. Here are the tibbles that
#' are returned:
#' -  comparison_tbl
#' -  deviance_tbl
#' -  total_deviance_tbl
#' -  aic_tbl
#' -  kolmogorov_smirnov_tbl
#' -  multi_metric_tbl
#'
#' The `comparison_tbl` is a long `tibble` that lists the values of the `density`
#' function against the given data.
#'
#' The `deviance_tbl` and the `total_deviance_tbl` just give the simple difference
#' from the actual density to the estimated density for the given estimated distribution.
#'
#' The `aic_tbl` will provide the `AIC` for a `lm` model of the estimated density
#' against the emprical density.
#'
#' The `kolmogorov_smirnov_tbl` for now provides a `two.sided` estimate of the
#' `ks.test` of the estimated density against the empirical.
#'
#' The `multi_metric_tbl` will summarise all of these metrics into a single tibble.
#'
#'
#' @description Compare some empirical data set against different distributions
#' to help find the distribution that could be the best fit.
#'
#' @param .x The data set being passed to the function
#' @param .distribution_type What kind of data is it, can be one of `continuous`
#' or `discrete`
#' @param .round_to_place How many decimal places should the parameter estimates
#' be rounded off to for distibution construction. The default is 3
#'
#' @examples
#' xc <- mtcars$mpg
#' output_c <- tidy_distribution_comparison(xc, "continuous")
#'
#' xd <- trunc(xc)
#' output_d <- tidy_distribution_comparison(xd, "discrete")
#'
#' output_c
#'
#' @return
#' An invisible list object. A tibble is printed.
#'
#' @export
#'

tidy_distribution_comparison <- function(.x, .distribution_type = "continuous",
                                         .round_to_place = 3) {

  # Tidyeval ----
  x_term <- as.numeric(.x)
  n <- length(x_term)
  dist_type <- tolower(as.character(.distribution_type))
  rt <- as.numeric(.round_to_place)

  if (!dist_type %in% c("continuous", "discrete")) {
    rlang::abort(
      message = "The '.distribution_type' parameter must be either 'continuous'
      or 'discrete'.",
      use_cli_format = TRUE
    )
  }

  # Get parameter estimates for distributions
  if (dist_type == "continuous") {
    b <- try(util_beta_param_estimate(x_term)$parameter_tbl %>%
               dplyr::filter(method == "NIST_MME") %>%
               dplyr::select(dist_type, shape1, shape2) %>%
               purrr::set_names("dist_type", "param_1", "param_2"))

    if (!inherits(b, "try-error")) {
      tb <- tidy_beta(.n = n, .shape1 = round(b[[2]], rt), .shape2 = round(b[[3]], rt))
    }

    c <- try(util_cauchy_param_estimate(x_term)$parameter_tbl %>%
               dplyr::select(dist_type, location, scale) %>%
               purrr::set_names("dist_type", "param_1", "param_2"))

    if (!inherits(c, "try-error")) {
      tc <- tidy_cauchy(.n = n, .location = round(c[[2]], rt), .scale = round(c[[3]], rt))
    }

    e <- try(util_exponential_param_estimate(x_term)$parameter_tbl %>%
               dplyr::select(dist_type, rate) %>%
               dplyr::mutate(param_2 = NA) %>%
               purrr::set_names("dist_type", "param_1", "param_2"))

    if (!inherits(e, "try-error")) {
      te <- tidy_exponential(.n = n, .rate = round(e[[2]], rt))
    }

    g <- try(util_gamma_param_estimate(x_term)$parameter_tbl %>%
               dplyr::filter(method == "NIST_MME") %>%
               dplyr::select(dist_type, shape, scale) %>%
               purrr::set_names("dist_type", "param_1", "param_2"))

    if (!inherits(g, "try-error")) {
      tg <- tidy_gamma(.n = n, .shape = round(g[[2]], rt), .scale = round(g[[3]], rt))
    }

    l <- try(util_logistic_param_estimate(x_term)$parameter_tbl %>%
               dplyr::filter(method == "EnvStats_MME") %>%
               dplyr::select(dist_type, location, scale) %>%
               purrr::set_names("dist_type", "param_1", "param_2"))

    if (!inherits(l, "try-error")) {
      tl <- tidy_logistic(.n = n, .location = round(l[[2]], rt), .scale = round(l[[3]], rt))
    }

    ln <- try(util_lognormal_param_estimate(x_term)$parameter_tbl %>%
                dplyr::filter(method == "EnvStats_MME") %>%
                dplyr::select(dist_type, mean_log, sd_log) %>%
                purrr::set_names("dist_type", "param_1", "param_2"))

    if (!inherits(ln, "try-error")) {
      tln <- tidy_lognormal(.n = n, .meanlog = round(ln[[2]], rt), .sdlog = round(ln[[3]], rt))
    }

    p <- try(util_pareto_param_estimate(x_term)$parameter_tbl %>%
               dplyr::filter(method == "MLE") %>%
               dplyr::select(dist_type, shape, scale) %>%
               purrr::set_names("dist_type", "param_1", "param_2"))

    if (!inherits(p, "try-error")) {
      tp <- tidy_pareto(.n = n, .shape = round(p[[2]], rt), .scale = round(p[[3]], rt))
    }

    u <- try(util_uniform_param_estimate(x_term)$parameter_tbl %>%
               dplyr::filter(method == "NIST_MME") %>%
               dplyr::select(dist_type, min_est, max_est) %>%
               purrr::set_names("dist_type", "param_1", "param_2"))

    if (!inherits(u, "try-error")) {
      tu <- tidy_uniform(.n = n, .min = round(u[[2]], rt), .max = round(u[[3]], rt))
    }

    w <- try(util_weibull_param_estimate(x_term)$parameter_tbl %>%
               dplyr::select(dist_type, shape, scale) %>%
               purrr::set_names("dist_type", "param_1", "param_2"))

    if (!inherits(w, "try-error")) {
      tw <- tidy_weibull(.n = n, .shape = round(w[[2]], rt), .scale = round(w[[3]], rt))
    }

    nn <- try(util_normal_param_estimate(x_term)$parameter_tbl %>%
                dplyr::filter(method == "EnvStats_MME_MLE") %>%
                dplyr::select(dist_type, mu, stan_dev) %>%
                purrr::set_names("dist_type", "param_1", "param_2"))

    if (!inherits(n, "try-error")) {
      tn <- tidy_normal(.n = n, .mean = round(nn[[2]], rt), .sd = round(nn[[3]], rt))
    }

    comp_tbl <- tidy_combine_distributions(
      tidy_empirical(x_term, .distribution_type = dist_type),
      if (exists("tb") && nrow(tb) > 0) {
        tb
      },
      if (exists("tc") && nrow(tc) > 0) {
        tc
      },
      if (exists("te") && nrow(te) > 0) {
        te
      },
      if (exists("tg") && nrow(tg) > 0) {
        tg
      },
      if (exists("tl") && nrow(tl) > 0) {
        tl
      },
      if (exists("tln") && nrow(tln) > 0) {
        tln
      },
      if (exists("tp") && nrow(tp) > 0) {
        tp
      },
      if (exists("tu") && nrow(tu) > 0) {
        tu
      },
      if (exists("tw") && nrow(tw) > 0) {
        tw
      },
      if (exists("tn") && nrow(tn) > 0) {
        tn
      }
    )
  } else {
    bn <- try(util_binomial_param_estimate(trunc(tidy_scale_zero_one_vec(x_term)))$parameter_tbl %>%
                dplyr::select(dist_type, size, prob) %>%
                purrr::set_names("dist_type", "param_1", "param_2"))

    if (!inherits(bn, "try-error")) {
      tb <- tidy_binomial(.n = n, .size = round(bn[[2]], rt), .prob = round(bn[[3]], rt))
    }

    ge <- try(util_geometric_param_estimate(trunc(x_term))$parameter_tbl %>%
                dplyr::filter(method == "EnvStats_MME") %>%
                dplyr::select(dist_type, shape) %>%
                dplyr::mutate(param_2 = NA) %>%
                purrr::set_names("dist_type", "param_1", "param_2"))

    if (!inherits(ge, "try-error")) {
      tg <- tidy_geometric(.n = n, .prob = round(ge[[2]], rt))
    }

    h <- try(util_hypergeometric_param_estimate(.x = trunc(x_term), .total = n, .k = n)$parameter_tbl %>%
               tidyr::drop_na() %>%
               dplyr::slice(1) %>%
               dplyr::select(dist_type, m, total) %>%
               purrr::set_names("dist_type", "param_1", "param_2"))

    if (!inherits(h, "try-error")) {
      th <- tidy_hypergeometric(
        .n = n,
        .m = trunc(h[[2]]),
        .nn = n - trunc(h[[2]]),
        .k = trunc(h[[2]])
      )
    }

    po <- try(util_poisson_param_estimate(trunc(x_term))$parameter_tbl %>%
                dplyr::select(dist_type, lambda) %>%
                dplyr::mutate(param_2 = NA) %>%
                purrr::set_names("dist_type", "param_1", "param_2"))

    if (!inherits(po, "try-error")) {
      tp <- tidy_poisson(.n = n, .lambda = round(po[[2]], rt))
    }

    comp_tbl <- tidy_combine_distributions(
      tidy_empirical(.x = x_term, .distribution_type = dist_type),
      if (exists("tb") && nrow(tb) > 0) {
        tb
      },
      if (exists("tg") && nrow(tg) > 0) {
        tg
      },
      if (exists("th") && nrow(th) > 0) {
        th
      },
      if (exists("tp") && nrow(tp) > 0) {
        tp
      }
    )
  }

  # Deviance calculations ----
  deviance_tbl <- comp_tbl %>%
    dplyr::select(dist_type, x, y) %>%
    dplyr::group_by(dist_type, x) %>%
    dplyr::mutate(y = stats::median(y)) %>%
    dplyr::ungroup() %>%
    dplyr::group_by(dist_type) %>%
    dplyr::mutate(y = tidy_scale_zero_one_vec(y)) %>%
    dplyr::ungroup() %>%
    tidyr::pivot_wider(
      id_cols = x,
      names_from = dist_type,
      values_from = y
    ) %>%
    dplyr::select(x, Empirical, dplyr::everything()) %>%
    dplyr::mutate(
      dplyr::across(
        .cols = -c(x, Empirical),
        .fns = ~ Empirical - .
      )
    ) %>%
    tidyr::drop_na() %>%
    tidyr::pivot_longer(
      cols = -x
    ) %>%
    dplyr::select(-x)

  total_deviance_tbl <- deviance_tbl %>%
    dplyr::filter(!name == "Empirical") %>%
    dplyr::group_by(name) %>%
    dplyr::summarise(total_deviance = sum(value)) %>%
    dplyr::ungroup() %>%
    dplyr::mutate(total_deviance = abs(total_deviance)) %>%
    dplyr::arrange(abs(total_deviance)) %>%
    dplyr::rename(
      dist_with_params = name,
      abs_tot_deviance = total_deviance
    )

  # AIC Data ----
  emp_data_tbl <- comp_tbl %>%
    dplyr::select(dist_type, x, dy) %>%
    dplyr::filter(dist_type == "Empirical")

  aic_tbl <- comp_tbl %>%
    dplyr::filter(!dist_type == "Empirical") %>%
    dplyr::select(dist_type, dy) %>%
    tidyr::nest(data = dy) %>%
    dplyr::mutate(
      lm_model = purrr::map(
        data,
        function(df) stats::lm(dy ~ emp_data_tbl$dy, data = df)
      )
    ) %>%
    dplyr::mutate(aic_value = purrr::map(lm_model, stats::AIC)) %>%
    dplyr::mutate(aic_value = unlist(aic_value)) %>%
    dplyr::mutate(abs_aic = abs(aic_value)) %>%
    dplyr::arrange(abs_aic) %>%
    dplyr::select(dist_type, aic_value, abs_aic)

  ks_tbl <- comp_tbl %>%
    dplyr::filter(dist_type != "Empirical") %>%
    dplyr::select(dist_type, dy) %>%
    tidyr::nest(data = dy) %>%
    dplyr::mutate(
      ks = purrr::map(
        .x = data,
        .f = ~ ks.test(
          x = .x$dy,
          y = emp_data_tbl$dy,
          alternative = "two.sided",
          simulate.p.value = TRUE
        )
      ),
      tidy_ks = purrr::map(ks, broom::tidy)
    ) %>%
    tidyr::unnest(cols = tidy_ks) %>%
    dplyr::select(-c(data, ks)) %>%
    dplyr::mutate(dist_char = as.character(dist_type)) %>%
    purrr::set_names(
      "dist_type", "ks_statistic", "ks_pvalue", "ks_method", "alternative",
      "dist_char"
    )

  multi_metric_tbl <- total_deviance_tbl %>%
    dplyr::mutate(dist_with_params = as.factor(dist_with_params)) %>%
    dplyr::inner_join(aic_tbl, by = c("dist_with_params" = "dist_type")) %>%
    dplyr::inner_join(ks_tbl, by = c("dist_with_params" = "dist_char")) %>%
    dplyr::select(dist_type, dplyr::everything(), -dist_with_params) %>%
    dplyr::mutate(dist_type = as.factor(dist_type))

  # Return ----
  output <- list(
    comparison_tbl         = comp_tbl,
    deviance_tbl           = deviance_tbl,
    total_deviance_tbl     = total_deviance_tbl,
    aic_tbl                = aic_tbl,
    kolmogorov_smirnov_tbl = ks_tbl,
    multi_metric_tbl       = multi_metric_tbl
  )

  # Attributes ----
  attr(deviance_tbl, ".tibble_type") <- "deviance_comparison_tbl"
  attr(total_deviance_tbl, ".tibble_type") <- "deviance_results_tbl"
  attr(aic_tbl, ".tibble_type") <- "aic_tbl"
  attr(comp_tbl, ".tibble_type") <- "comparison_tbl"
  attr(ks_tbl, ".tibble_type") <- "kolmogorov_smirnov_tbl"
  attr(multi_metric_tbl, ".tibble_type") <- "full_metric_tbl"
  attr(output, ".x") <- x_term
  attr(output, ".n") <- n

  # Return ----
  return(invisible(output))
}
rwhitt221 commented 1 year ago

Amazing stuff. Thanks

spsanderson commented 1 year ago

I will add this to the dev package today and work through the submission process, hope to submit tomorrow.