Open WilDoane opened 5 years ago
It turns out this is not a small task.
For the time being, I suggest that you take a look at the tidyxl/unpivotr pair of packages. Their maintainer @nacnudus recently shared a screencast of doing exactly this.
https://www.youtube.com/watch?v=Q_McYaDV9H0
At least I think that's the one where he does this. If I'm wrong, he can straighten us out.
Thanks @jennybc that's the one.
There are a couple of extra gotchas in this set of tables that I haven't got around to screencasting/writing about yet. One is the superscript letters to indicate footnotes -- they only appear twice here ("All doctorate recipientsa", "Otherb") so I wouldn't bother doing anything fancy to avoid them. The other is that while "Life sciences" is indented by 1 under "All doctorate recipientsa", it is indented by 2 under "Male" and "Female". The easiest way to deal with that is to subtract 1 from the indentation of every row from "Male" below.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyxl)
library(unpivotr)
path <- "~/Downloads/sed17-sr-tab015.xlsx"
cells <- xlsx_cells(path)
formats <- xlsx_formats(path)
indent <- formats$local$alignment$indent
tidy <-
cells %>%
dplyr::filter(row >= 4L, col <= 11L, !is_blank) %>%
mutate(indent = indent[local_format_id],
# Move headers in rows >=50 left one level of indentation
indent = if_else(row >= 50L, indent - 1L, indent)) %>%
behead("N", "year") %>%
behead_if(indent == 0L, direction = "WNW", name = "sex") %>%
behead_if(indent == 1L, direction = "WNW", name = "field") %>%
behead("W", "subfield") %>%
arrange(col, row) %>%
select(sex, field, subfield, year, value = numeric) %>%
print()
#> # A tibble: 1,350 x 5
#> sex field subfield year value
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 All doctorate… <NA> <NA> 2008 48777
#> 2 All doctorate… Life sciences <NA> 2008 11086
#> 3 All doctorate… Life sciences Agricultural sciences and… 2008 1198
#> 4 All doctorate… Life sciences Biological and biomedical… 2008 7797
#> 5 All doctorate… Life sciences Health sciences 2008 2091
#> 6 All doctorate… Physical sciences… <NA> 2008 4946
#> 7 All doctorate… Physical sciences… Chemistry 2008 2246
#> 8 All doctorate… Physical sciences… Geosciences, atmospheric … 2008 865
#> 9 All doctorate… Physical sciences… Physics and astronomy 2008 1835
#> 10 All doctorate… Mathematics and c… <NA> 2008 3187
#> # … with 1,340 more rows
Created on 2019-03-01 by the reprex package (v0.2.0.9000).
The XLSX format allows cells to be "indented", which is usually used to denote hierarchical data. See for example NSF data tables such as https://ncses.nsf.gov/pubs/nsf19301/assets/data/tables/sed17-sr-tab015.xlsx
This indenting data is ignored by read_xlsx, making it difficult to capture the nestings.
Those column labels don't reflect the category > subcategory > sub-subcategory structure of the data. (system isn't letting me upload graphic)
Desired Behavior Have loaded data reflect indentation levels, possibly by prepending TAB or SPACE characters (which I recognize would conflict, if trim_ws = TRUE)