tidyverse / readxl

Read excel files (.xls and .xlsx) into R đź–‡
https://readxl.tidyverse.org
Other
725 stars 192 forks source link

read_xlsx can't detect indentation applied via a format #557

Open WilDoane opened 5 years ago

WilDoane commented 5 years ago

The XLSX format allows cells to be "indented", which is usually used to denote hierarchical data. See for example NSF data tables such as https://ncses.nsf.gov/pubs/nsf19301/assets/data/tables/sed17-sr-tab015.xlsx

This indenting data is ignored by read_xlsx, making it difficult to capture the nestings.

library(readxl)
library(dplyr)

download.file("https://ncses.nsf.gov/pubs/nsf19301/assets/data/tables/sed17-sr-tab015.xlsx",
              destfile = "local.xlsx", mode = "wb")

read_xlsx("local.xlsx", skip = 3) %>% 
  select(1)

Those column labels don't reflect the category > subcategory > sub-subcategory structure of the data. (system isn't letting me upload graphic)

Desired Behavior Have loaded data reflect indentation levels, possibly by prepending TAB or SPACE characters (which I recognize would conflict, if trim_ws = TRUE)

jennybc commented 5 years ago

It turns out this is not a small task.

For the time being, I suggest that you take a look at the tidyxl/unpivotr pair of packages. Their maintainer @nacnudus recently shared a screencast of doing exactly this.

https://www.youtube.com/watch?v=Q_McYaDV9H0

At least I think that's the one where he does this. If I'm wrong, he can straighten us out.

nacnudus commented 5 years ago

Thanks @jennybc that's the one.

There are a couple of extra gotchas in this set of tables that I haven't got around to screencasting/writing about yet. One is the superscript letters to indicate footnotes -- they only appear twice here ("All doctorate recipientsa", "Otherb") so I wouldn't bother doing anything fancy to avoid them. The other is that while "Life sciences" is indented by 1 under "All doctorate recipientsa", it is indented by 2 under "Male" and "Female". The easiest way to deal with that is to subtract 1 from the indentation of every row from "Male" below.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyxl)
library(unpivotr)

path <- "~/Downloads/sed17-sr-tab015.xlsx"

cells <- xlsx_cells(path)
formats <- xlsx_formats(path)
indent <- formats$local$alignment$indent

tidy <-
  cells %>%
  dplyr::filter(row >= 4L, col <= 11L, !is_blank) %>%
  mutate(indent = indent[local_format_id],
         # Move headers in rows >=50 left one level of indentation
         indent = if_else(row >= 50L, indent - 1L, indent)) %>%
  behead("N", "year") %>%
  behead_if(indent == 0L, direction = "WNW", name = "sex") %>%
  behead_if(indent == 1L, direction = "WNW", name = "field") %>%
  behead("W", "subfield") %>%
  arrange(col, row) %>%
  select(sex, field, subfield, year, value = numeric) %>%
  print()
#> # A tibble: 1,350 x 5
#>    sex            field              subfield                   year  value
#>    <chr>          <chr>              <chr>                      <chr> <dbl>
#>  1 All doctorate… <NA>               <NA>                       2008  48777
#>  2 All doctorate… Life sciences      <NA>                       2008  11086
#>  3 All doctorate… Life sciences      Agricultural sciences and… 2008   1198
#>  4 All doctorate… Life sciences      Biological and biomedical… 2008   7797
#>  5 All doctorate… Life sciences      Health sciences            2008   2091
#>  6 All doctorate… Physical sciences… <NA>                       2008   4946
#>  7 All doctorate… Physical sciences… Chemistry                  2008   2246
#>  8 All doctorate… Physical sciences… Geosciences, atmospheric … 2008    865
#>  9 All doctorate… Physical sciences… Physics and astronomy      2008   1835
#> 10 All doctorate… Mathematics and c… <NA>                       2008   3187
#> # … with 1,340 more rows

Created on 2019-03-01 by the reprex package (v0.2.0.9000).