nacnudus / unpivotr

Unpivot complex and irregular data layouts in R
https://nacnudus.github.io/unpivotr/
Other
185 stars 19 forks source link

enhead appears not to work as expected with NNW #20

Closed dmcalli2 closed 5 years ago

dmcalli2 commented 5 years ago

Hello, Great package! Thanks for making this and for tidyxl. I have read the excellent munging strategies document, but for the following example table enhead did not behave as I expected.

row/col 1(A) 2(B) 3(C)
1 NA MMSE 12-26 NA
2 NA Placebo Simvastatin
3 Measures (n = 17) (n = 20)
4 Serum NA NA
5 ALDL cholesterol -5.9 ± 16.5 -51.9 ± 16.2“

In the above table, MMSE 12-26 applies to columns 2 and 3. Accordingly, I used NNW but it didn't work.

The following reproducible example shows the behaviour of enhead, as well as my own crude function to produce the behaviour I desired.

Apologies if I have misunderstood.

Minimal example

library(unpivotr)
library(tidyverse)

## Create data
my_cells <- tibble(
  row = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L),
  col = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L),
  data_type = c("blank", "character", "blank", "blank", "character", "character", 
                "character", "character", "character", "character", "blank", 
                "blank", "character", "character", "character"),
  character = c(NA, "MMSE 12-26", NA, NA, "Placebo", "Simvastatin", "Measures", 
                "(n = 17)", "(n = 20)", "Serum", NA, NA, "ALDL cholesterol", 
                "-5.9 ± 16.5", "-51.9 ± 16.2“")
)
## Display data in excel table format
rectify(my_cells)

## Select data from within table
my_data_cells <- my_cells %>% 
  filter(row>3, col>1) %>% 
  select(row, col, character)
rectify(my_cells)

## Select first row as a header from within tables
my_header_cells <- my_cells %>% 
  filter(row == 1) 

## Use enhead to join the header to the table
add_header <- enhead(my_data_cells, my_header_cells, direction = "NNW")

## Create simple looping function to produce desired behaviour
NNW <- function(data_cells, header_cells) {
  if("character" %in% names(data_cells)) names(data_cells)[names(data_cells) == "character"] <- "value"

  d <- distinct(data_cells, col) %>% 
    mutate(col_h = NA)

  h <- header_cells %>% 
    rename(col_h = col) %>% 
    select(col_h, character) %>% 
    filter(!is.na(character))

  for(col_data in d$col){
    col_test <- col_data
    while(col_test > 0L){
      if(col_test %in% h$col_h) {
        d$col_h[d$col == col_data] <- col_test
        break()
      }
        col_test <- col_test -1

    }

  }

   data_cells %>% 
    inner_join(d) %>% 
    inner_join(h) %>% 
     select(-col_h)
}
## Run simple function
add_header_manual <- NNW(my_data_cells, my_header_cells)

## compare outputs
### enhead version
add_header
### crude version
add_header_manual
nacnudus commented 5 years ago

Hi @dmcalli2,

Thank you for your kind words and the reproducible example. You might enjoy using the reprex package, which makes it easy to include the outputs as well as the code.

Your code was very nearly there -- I have added a couple of steps to rename columns so that you don't get character.x and character.y columns at the end, and one step filter out NA cells from the headers, which is why the "NNW" direction didn't appear to work.

I hope to get time to improve Spreadsheet Munging Strategies, and if you could open an issue for a specific section that would you help you, that would help me too :) In the mean time I am making some videos.

library(unpivotr)
library(tidyverse)

## Create data
my_cells <- tibble(
  row = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L),
  col = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L),
  data_type = c("blank", "character", "blank", "blank", "character", "character",
                "character", "character", "character", "character", "blank",
                "blank", "character", "character", "character"),
  character = c(NA, "MMSE 12-26", NA, NA, "Placebo", "Simvastatin", "Measures",
                "(n = 17)", "(n = 20)", "Serum", NA, NA, "ALDL cholesterol",
                "-5.9 ± 16.5", "-51.9 ± 16.2“")
)
## Display data in excel table format
rectify(my_cells)
#> # A tibble: 5 x 4
#>   `row/col` `1(A)`           `2(B)`      `3(C)`       
#>       <int> <chr>            <chr>       <chr>        
#> 1         1 <NA>             MMSE 12-26  <NA>         
#> 2         2 <NA>             Placebo     Simvastatin  
#> 3         3 Measures         (n = 17)    (n = 20)     
#> 4         4 Serum            <NA>        <NA>         
#> 5         5 ALDL cholesterol -5.9 ± 16.5 -51.9 ± 16.2“

## Select data from within table
my_data_cells <- my_cells %>%
  filter(row>3, col>1) %>%
  select(row, col, value = character) # rename the columns
my_data_cells
#> # A tibble: 4 x 3
#>     row   col value        
#>   <int> <int> <chr>        
#> 1     4     2 <NA>         
#> 2     4     3 <NA>         
#> 3     5     2 -5.9 ± 16.5  
#> 4     5     3 -51.9 ± 16.2“
rectify(my_data_cells, value)
#> # A tibble: 2 x 3
#>   `row/col` `2(B)`      `3(C)`       
#>       <int> <chr>       <chr>        
#> 1         4 <NA>        <NA>         
#> 2         5 -5.9 ± 16.5 -51.9 ± 16.2“

## Select first row as a header from within tables
my_header_cells <- my_cells %>%
  filter(row == 1, !is.na(character)) %>% # filter out NA cells
  select(row, col, header = character) # rename the columns
my_header_cells
#> # A tibble: 1 x 3
#>     row   col header    
#>   <int> <int> <chr>     
#> 1     1     2 MMSE 12-26

## Use enhead to join the header to the table
add_header <- enhead(my_data_cells, my_header_cells, direction = "NNW")
add_header
#> # A tibble: 4 x 4
#>     row   col value         header    
#>   <int> <int> <chr>         <chr>     
#> 1     4     2 <NA>          MMSE 12-26
#> 2     4     3 <NA>          MMSE 12-26
#> 3     5     2 -5.9 ± 16.5   MMSE 12-26
#> 4     5     3 -51.9 ± 16.2“ MMSE 12-26

Created on 2018-12-19 by the reprex package (v0.2.0.9000).