Invalid number of breaks from hist() call in get_cog_distributions()

ataustin commented 6 years ago

This is surely an edge case, but one of my derived cognostics produced this error when building a trelliscope display: Error in hist.default(x, breaks = brks, plot = FALSE) : invalid number of 'breaks'

I traced it back to get_cog_distributions, where in my case breaks was a single-element vector. The problem seems to occur when breaks is between 0 and 1.

Here is a reproducible example:

library(trelliscopejs)
library(tidyverse)

df <- data.frame(group = letters[1:10],
                 val1  = rep(c(0, 0.9), times = c(9, 1))
                 )

df %>%
  group_by(group) %>%
  nest(.key = "data") %>%
  mutate(panel = map_plot(data, function(x) ggplot(x, aes(val1)) + geom_histogram())) %>%
  trelliscope(name = "test",
              path = "output")

Here is some system information:

> packageVersion("trelliscopejs")
[1] '0.1.11'
> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.4 (Maipo)

locale:
 [1] LC_CTYPE=en_US.ISO8859-1   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] bindrcpp_0.2         dplyr_0.7.4          purrr_0.2.4
 [4] readr_1.1.1          tidyr_0.7.2          tibble_1.3.4
 [7] ggplot2_2.2.1        tidyverse_1.0.0      trelliscopejs_0.1.11
[10] colorout_1.1-3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.14            bindr_0.1               knitr_1.17
 [4] magrittr_1.5            hms_0.3                 progress_1.1.2
 [7] tidyselect_0.2.3        munsell_0.4.3           colorspace_1.3-1
[10] R6_2.2.2                rlang_0.1.4             plyr_1.8.4
[13] tools_3.3.2             webshot_0.5.0           grid_3.3.2
[16] gtable_0.2.0            htmltools_0.3.6         lazyeval_0.2.1
[19] assertthat_0.2.0        digest_0.6.12           DistributionUtils_0.5-1
[22] base64enc_0.1-3         glue_1.2.0              labeling_0.3
[25] scales_0.5.0            prettyunits_1.0.2       jsonlite_1.5
[28] pkgconfig_2.0.1

beansrowning commented 6 years ago

I have this same error.

Here is the relevant code block:

else if (type == "numeric") {
      # n0 <- length(which(x == 0))
      # log <- FALSE
      skw <- DistributionUtils::skewness(x, na.rm = TRUE)

      hst <- graphics::hist(x, plot = FALSE)
      res <- list(
        type = type,
        dist = list(
          raw = list(
            breaks = hst$breaks,
            freq = hst$counts
          )
        )
      )
      res$log_default <- FALSE
      if (!is.nan(skw) && skw > 1.5 && all(x >= 0, na.rm = TRUE)) {
        # log <- TRUE
        x <- x[x > 0]
        x2 <- log10(x)
        rng <- range(x2, na.rm = TRUE)
        brks <- 10 ^ seq(rng[1], rng[2], length = grDevices::nclass.Sturges(x))
        lhst <- hist(x, breaks = brks, plot = FALSE)
        res$dist$log <- list(
          breaks = lhst$breaks,
          freq = lhst$counts
        )
        res$log_default <- TRUE
      }
    }

In this case, there is only 1 non-zero entry. Skewedness is therefore > 1.5, but when x <- x[x > 0] is evaluated, it will return a single number with which to produce breaks for the histogram, which will throw an error since you need more than one.

One fix could be to test the length of x[x > 0], and have a separate case if it is 1.

ataustin commented 6 years ago

hist() is a little weird in that it will accept a length-1 vector and interpret it as the number of cells in the histogram, rather than the break locations. For length-1 breaks where the value of breaks is larger than 1, it rounds down to the nearest integer and uses that as the number of cells. But for breaks < 1 it gives the error.

> x <- 1:10
> hist(x, breaks = 1, plot = FALSE)$breaks
## [1]  0 10
>
> hist(x, breaks = 2.2, plot = FALSE)$breaks
## [1]  0  5 10
>
> hist(x, breaks = 0.4, plot = FALSE)$breaks
## Error in hist.default(x, breaks = 0.4, plot = FALSE) :
##   invalid number of 'breaks'
>
> hist(x, breaks = -2, plot = FALSE)$breaks
## Error in hist.default(x, breaks = -2, plot = FALSE) :
##   invalid number of 'breaks'

beansrowning commented 6 years ago

Ah yes, you are quite right. I suppose that behaviour makes it a bit odd to program with.

Still, I think the easiest fix would be to just keep whichever histogram came out at the beginning if there is only one value. Maybe there would be some side effects to that performance-wise or otherwise, but it could be as simple as :

if (!is.nan(skw) && skw > 1.5 && all(x >= 0, na.rm = TRUE) && length(x[x>0] > 1)

Happy to test that or a better fix on my machine later and submit a PR.

ataustin commented 6 years ago

You would be my hero. I haven't had the time to do a PR but the bug is causing us some grief at the office.

hafen commented 6 years ago

Thank you! Yes please do test and submit a PR. Much appreciated. I apologize for being slow on this issue!

ataustin commented 6 years ago

@hafen no worries, trelliscopejs is :1st_place_medal:! I'm presenting on it to my colleagues tomorrow. :smiley:

beansrowning commented 6 years ago

@hafen no worries at all, we are all busy! I went ahead and submitted a PR.

ataustin commented 6 years ago

Closed by https://github.com/hafen/trelliscopejs/pull/56 :fireworks:

hafen / trelliscopejs

Invalid number of breaks from hist() call in get_cog_distributions() #55