hughjonesd / huxtable

An R package to create styled tables in multiple output formats, with a friendly, modern interface.
http://hughjonesd.github.io/huxtable
Other
321 stars 28 forks source link

Merge all repeated cells (not just by row) #156

Closed sbihorel closed 4 years ago

sbihorel commented 4 years ago

Describe what is missing huxtable does not provide any function to automatically merge repeated cells by column and/or row.

Describe your proposed solution See code below for new proposed function merge_repeated_cells

#' Merge a range of cells with identical content
#'
#' @param ht A huxtable.
#' @param row A row specifier. See \code{\link{rowspecs}} for details. 
#' @param col A column specifier.
#' @param merge_first Either 'row' or 'column' (default). Define how cells are merged first
#'
#' @details
#' `merge_repeated_cells` automatically detects and merge rectangular-shape ranges of cells with identical
#' content. For the purpose of the scan for repeated values, the content of the huxtable is coerced to 
#' character but the original class are maintained in the function output. Nevertheless, you must ensure that
#' any data transformation (ie, rounding) is performed prior to calling `merge_repeated_cells`.
#' 
#' By default, cells are first merged by column then by row. If the opposite is needed, set the 
#' \code{merge_first} argument to 'row'.
#' 
#' @return The `ht` object.
#'
#' @export
#' @examples
#' ht <- hux(
#' var1 = c(1, 2, 3, 3, 5, 6),
#' var2 = c('a', 'b', 'c', 'c', 'e', 'e'),
#' var3 = c('A', 'b', 'c', 'c', 'e', 'f'),
#' var4 = c('g', 'b', 'h', 'i', 'j', 'k'),
#' var5 = c('l', 'l', 'o', 'p', 'j', 'q'),
#' stringsAsFactors = FALSE
#' 
#' ht <- set_all_borders(ht, 1)
#' merge_repeated_cells(ht)
#' merge_repeated_cells(ht, c(2,5))
#' merge_repeated_cells(ht, merge_first = 'row')

merge_repeated_cells <- function(ht, rows, cols, merge_first = c('column', 'row'))  {

  if (missing(ht)){
    stop('ht argument is missing')
  }
  mod_ht <- as.matrix(ht)

  if (missing(rows)){
    rows <- 1:nrow(mod_ht)
  }
  if (missing(cols)){
    cols <- 1:ncol(mod_ht)
  }
  if (!is.numeric(rows)){
    rows <- 1:nrow(mod_ht)
  } else {
    rows <- sort( unique(rows) )
    rows[1] <- max( c(1, min(rows)) )
    rows[length(rows)] <- min( c(nrow(mod_ht), max(rows)) )
  }
  if (!is.numeric(cols)){
    cols <- 1:ncol(mod_ht)
  } else {
    cols <- sort(unique(cols))
    cols[1] <- max( c(1, min(cols)) )
    cols[length(cols)] <- min( c(ncol(mod_ht), max(cols)) )
  }
  merge_first <- match.arg(merge_first)

  all_same <- function(x) {
    xx <- as.vector(x)
    all(xx == xx[1])
  }

  # Transpose mod_ht if necessary
  if (merge_first == 'row'){
    mod_ht <- t(mod_ht)
    tmp <- rows
    rows <- cols
    cols <- tmp
    rm(tmp)
  }

  # Initiate mapping objects
  mapping_matrix <- matrix(
    0,
    nrow = nrow(mod_ht),
    ncol = ncol(mod_ht)
  )
  coords <- list()
  block <- 1

  for (irow in rows){

    for (icol in cols){

      if ( mapping_matrix[irow, icol] == 0 ){

        # Initiate tokens
        hstop <- vstop <- FALSE
        hshift <- vshift <- 1

        # Expansion of search rectangles
        while ( !hstop | !vstop ) {

          # Search algorithm
          if ( (icol + hshift) %in% cols & !hstop ){
            if ( all_same(mod_ht[irow:(irow+vshift-1), icol:(icol+hshift)]) ){
              hshift <- hshift + 1
            } else {
              hstop <- TRUE
            }
          } else {
            hstop <- TRUE
          }

          if ( (irow + vshift) %in% rows & !vstop ){
            if ( all_same(mod_ht[irow:(irow+vshift), icol:(icol+hshift-1)]) ){
              vshift <- vshift + 1
            } else {
              vstop <- TRUE
            }
          } else {
            vstop <- TRUE
          }
        }

        # Store information
        if ( hshift > 1 | vshift > 1 ){
          coords[[block]] <- list(
            rows = irow:(irow + vshift - 1),
            cols = icol:(icol + hshift - 1)
          )
          mapping_matrix[coords[[block]]$rows, coords[[block]]$cols] <- block
          block <- block + 1
        }

      }
    }
  }

  # Merge cells
  if ( length(coords) > 0 ){
    for ( iblock in 1:length(coords) ){
      if (merge_first == 'column'){
        ht <- merge_cells(ht, coords[[iblock]]$rows, coords[[iblock]]$cols) 
      } else {
        ht <- merge_cells(ht, coords[[iblock]]$cols, coords[[iblock]]$rows)
      }
    }
  }

  return(ht)

}

Target audience All users

Additional context

sbihorel commented 4 years ago

Sorry @hughjonesd, I don't know how to do pull request, so I posted my suggested code as a feature issue.

hughjonesd commented 4 years ago

Hey, that's cool and high effort. Thank you.

Doing a pull request is going to make it much easier to test, review and (perhaps!) integrate this, and it's a useful skill, so I'll tell you how.

  1. Fork my repository to your own github account. You can do that using the "Fork" button right on this page.
  2. On your command line do "git clone " - you can copy and paste the path from your newly created repository page.
  3. Jump into RStudio (there's an .Rproj file for the project).
  4. Create a new branch and call it something appropriate, with the little purple boxes icon on the "Git" tab.
  5. Make your changes.
  6. Push to github.
  7. Go to github and click "pull request".

Next up, I'll look at this code and give you my thoughts.

hughjonesd commented 4 years ago

So my first question is, what's the use case? When did you need this function? Who else would need it?

The use case for merge_repeated_rows() is that often, a particular column reflects some grouping data that could be represented in a huxtable by a single piece of text. For example, look at the Species column in iris.

I can see something similar being true across a particular row, but it seems rarer. For most data, especially if it's in "tidy" format, different columns represent fundamentally different pieces of data. The exception is data in wide format where you might have e.g. "rank in week 1", "rank in week 2", etc. Though there again, it's not clear you'd often want to merge across those columns – it would be a very specific use case.

This is something more general again. I wonder how often people will have a "square" of data that they want to turn into a single cell.

My second comment is, the function is long and complex. I can see it being hard to maintain... and huxtable has a lot of hard-to-maintain code already, written by yours truly :-) ....

Overall then, I'm not convinced of this. But I do appreciate the effort. A simpler merge_repeated_cols (or columns? Hmm...) might have a better chance of getting in. Another project I'm thinking about is something to take a "grouping" column and turn it into a header row running across the top of the group. (I think the gt package has something similar.)

A few minor points:

sbihorel commented 4 years ago

Hi David,

Thanks for the step-by-step instructions on github pull requests. I will look into it.

Regarding your questions:

Cheers

Sebastien

On Sun, Jun 7, 2020 at 1:48 AM David Hugh-Jones notifications@github.com wrote:

So my first question is, what's the use case? When did you need this function? Who else would need it?

The use case for merge_repeated_rows() is that often, a particular column reflects some grouping data that could be represented in a huxtable by a single piece of text. For example, look at the Species column in iris.

I can see something similar being true across a particular row, but it seems rarer. For most data, especially if it's in "tidy" format, different columns represent fundamentally different pieces of data. The exception is data in wide format where you might have e.g. "rank in week 1", "rank in week 2", etc. Though there again, it's not clear you'd often want to merge across those columns – it would be a very specific use case.

This is something more general again. I wonder how often people will have a "square" of data that they want to turn into a single cell.

My second comment is, the function is long and complex. I can see it being hard to maintain... and huxtable has a lot of hard-to-maintain code already, written by yours truly :-) ....

Overall then, I'm not convinced of this. But I do appreciate the effort. A simpler merge_repeated_cols (or columns? Hmm...) might have a better chance of getting in. Another project I'm thinking about is something to take a "grouping" column and turn it into a header row running across the top of the group. (I think the gt package has something similar.)

A few minor points:

  • use assertthat to check arguments on your way in.
  • rows and columns should accept arguments in huxtable style, see merge_repeated_rows for one way to do this.
  • Why would it matter whether you merge columns or rows first?
  • If rows or cols are outside the range of the huxtable, I'd throw an error rather than trying to guess the user's intention. get_rc_spec will do that for you, again, see merge_repeated_rows.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hughjonesd/huxtable/issues/156#issuecomment-640161543, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETTNP6CK4O3C4DTGTDGT3DRVMS2LANCNFSM4NV7LPCA .

sbihorel commented 4 years ago

Hi David,

Correct me if I am wrong but I don't think that get_rc_specs errors out when coordinates are outside the dimension of the huxtable:

ht <- hux( a = 1:2, b = 3:4 ) rows <- 4:8 cols <- 3:5

Still returns 4:8

huxtable:::get_rc_spec(ht, rows, 1)

Still returns 3:5

huxtable:::get_rc_spec(ht, cols, 1)

Executed example:

ht <- hux(

  • a = 1:2,
  • b = 3:4
  • ) rows <- 4:8 cols <- 3:5

    Still returns 4:8

    huxtable:::get_rc_spec(ht, rows, 1) [1] 4 5 6 7 8

    Still returns 3:5

    huxtable:::get_rc_spec(ht, cols, 1) [1] 3 4 5 sessionInfo() R version 3.6.3 (2020-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Linux Mint 19.3

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] jsonlite_1.6.1 huxtable_4.7.1

loaded via a namespace (and not attached): [1] zip_2.0.4 Rcpp_1.0.4.6 pillar_1.4.3 compiler_3.6.3 base64enc_0.1-3 tools_3.6.3 digest_0.6.25 uuid_0.1-4 evaluate_0.14 lifecycle_0.2.0 [11] tibble_3.0.1 lattice_0.20-41 pkgconfig_2.0.3 rlang_0.4.6 rstudioapi_0.11 xfun_0.13 officer_0.3.11 dplyr_0.8.5 stringr_1.4.0 knitr_1.28 [21] xml2_1.3.2 generics_0.0.2 gdtools_0.2.2 vctrs_0.2.4 systemfonts_0.2.2 grid_3.6.3 tidyselect_1.0.0 glue_1.4.0 data.table_1.12.8 R6_2.4.1 [31] rmarkdown_2.1 purrr_0.3.4 magrittr_1.5 ellipsis_0.3.0 htmltools_0.4.0 assertthat_0.2.1 flextable_0.5.10 stringi_1.4.6 crayon_1.3.4 zoo_1.8-8

On Sun, Jun 7, 2020 at 1:48 AM David Hugh-Jones notifications@github.com wrote:

So my first question is, what's the use case? When did you need this function? Who else would need it?

The use case for merge_repeated_rows() is that often, a particular column reflects some grouping data that could be represented in a huxtable by a single piece of text. For example, look at the Species column in iris.

I can see something similar being true across a particular row, but it seems rarer. For most data, especially if it's in "tidy" format, different columns represent fundamentally different pieces of data. The exception is data in wide format where you might have e.g. "rank in week 1", "rank in week 2", etc. Though there again, it's not clear you'd often want to merge across those columns – it would be a very specific use case.

This is something more general again. I wonder how often people will have a "square" of data that they want to turn into a single cell.

My second comment is, the function is long and complex. I can see it being hard to maintain... and huxtable has a lot of hard-to-maintain code already, written by yours truly :-) ....

Overall then, I'm not convinced of this. But I do appreciate the effort. A simpler merge_repeated_cols (or columns? Hmm...) might have a better chance of getting in. Another project I'm thinking about is something to take a "grouping" column and turn it into a header row running across the top of the group. (I think the gt package has something similar.)

A few minor points:

  • use assertthat to check arguments on your way in.
  • rows and columns should accept arguments in huxtable style, see merge_repeated_rows for one way to do this.
  • Why would it matter whether you merge columns or rows first?
  • If rows or cols are outside the range of the huxtable, I'd throw an error rather than trying to guess the user's intention. get_rc_spec will do that for you, again, see merge_repeated_rows.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hughjonesd/huxtable/issues/156#issuecomment-640161543, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETTNP6CK4O3C4DTGTDGT3DRVMS2LANCNFSM4NV7LPCA .

hughjonesd commented 4 years ago

I understand. I think this probably isn't useful for most guys. Having seen your example, I wonder whether running merge_repeated_rows() followed by merge_repeated_columns() (written along the same lines) would do what you need. You'd want to avoid huxtable complaining about overwriting existing multicolumn cells, which might be tricky. Incidentally, if you want to play this, t(merge_repeated_rows(t(hux))) is a quick hack for merge_repeated_columns().

Thank you for that last point. I wonder if I should fix it. Presumably errors are gonna turn up somewhere....