nacnudus / tidyxl

Read untidy Excel files in R https://nacnudus.github.io/tidyxl/
https://nacnudus.github.io/tidyxl/
Other
246 stars 20 forks source link

Identify Grouped Rows in Excel #77

Closed AlanFeder closed 2 years ago

AlanFeder commented 2 years ago

Hi,

I am sometimes sent Excel files by contributors who group rows, using the "Outline" Function in Excel ( as explained here). Is there a way to identify the outline levels in a document?
My ultimate goal is to "un-outline" the data - and do primary analysis on the lowest level, but keeping the ability to do my own group_by() %>% summarize() with the higher level groupings.

Thank you! This is a great package!

nacnudus commented 2 years ago

Hi, sorry for a slow reply. I think this will be possible.

Notes to self: ECMA p1600 the col and row elements have an outlineLevel attribute between 1 and 7, recorded in the same way as custom col widths and row heights.

nacnudus commented 2 years ago

A branch is ready to be tried. It can be installed with remotes::install_github("r-lib/pkgconfig#7").

How do you envisage using this? The tree-like structure that Excel displays is unfortunately not represented in the data. Each row and column only records its own depth in the tree, but not its relation to its neighbours.

AlanFeder commented 2 years ago

Thanks! I'll try to take a look over the next few days.

I envision using it by extracting out each "level" into its own data frame, and then trying to dplyr::left_join() them (maybe with fuzzy matching on row numbers or something like that) into one table...alternatively, make a whole list of data frames (one for each 'section' in the lowest level), using a for-loop to attach on the correct "higher-level" names, and then dplyr::bind_rows() to combine them.
It's sorta how I used it in the past, where I came up with some convoluted logic based on the color/indentation of the cell - so this would simplify the logic/code, but probably not fundamentally change the logic (for now)

AlanFeder commented 2 years ago

I'm getting a slightly weird result -- I'm not sure if its fixable or if its inherent to Excel.

  1. What I would have thought would be row_outline_level is actually col_outline_level, and vice-versa. If this is just a naming convention by Excel, that's fine -- but I wanted to point that out
    • note that in your testthat outline.xlsx example for sheet "outline levels", note that each row has the same col_outline_level and each col has the same row_outline_level...I would have thought it the other way
  2. My data has three levels, but xlsx_cells() only identifies two of them -- the middle level is "1" and the top and bottom levels are both "2"
    • note that in your testthat outline.xlsx example for sheet "outline levels", col A and col D each are row_outline_level "3", while rows B and C are row_outline_levels 1 and 2, respectively. By contrast, rows 1-4 each have distinct col_outline_levels (of 0-3) as I'd expect. My data had similar issues, except my rows had the issue shown by the columns here (and I didn't have column groupings)

Thanks again for all your hard work -- I really appreciate all your effort, and the digging into whatever happens under excel

nacnudus commented 2 years ago

Oops, I think I've fixed those two mistakes. Thanks for checking.

AlanFeder commented 2 years ago

Thanks! The new version works well and indeed fixed the issue!

One last suggestion (although I have no idea if it is difficult/feasible):

I noted that the "top level" of the outline is 0, as is the indicator for any row/col that is not in an outline at all. Is there a way to change this, so that we can distinguish when something is not in an outline at all? (either with NA or having the outline levels be 1-4 instead of 0-3 for a 4-level outline)

If not, this is still extremely useful, and will definitely take advantage -- thanks for your great work!

nacnudus commented 2 years ago

Levels are already 1-4 for a 4-level outline. Do you see something different?

I ran the following from the root of this the repo at this branch.

print(tidyxl::xlsx_cells("tests/testthat/outlines.xlsx")[, c("sheet", "row", "col", "numeric", "row_outline_level", "col_outline_level")], n = Inf)
AlanFeder commented 2 years ago

When I ran your code, I saw the numbers of 0-3 -- even though, when I open the Excel file, I see the numbers 1-4 in the "outline" area.

Am I misunderstanding what the results should be?

nacnudus commented 2 years ago

I see, the button numbering from 1 to 4 implies that level 1 includes every row/col. It's recorded as level zero in the specification, but I'll add 1 to every level.

nacnudus commented 2 years ago

Thanks for testing this. Feel free to reopen if the fix didn't work.