Closed AlanFeder closed 2 years ago
Hi, sorry for a slow reply. I think this will be possible.
Notes to self: ECMA p1600 the col
and row
elements have an outlineLevel
attribute between 1
and 7
, recorded in the same way as custom col widths and row heights.
A branch is ready to be tried. It can be installed with remotes::install_github("r-lib/pkgconfig#7")
.
How do you envisage using this? The tree-like structure that Excel displays is unfortunately not represented in the data. Each row and column only records its own depth in the tree, but not its relation to its neighbours.
Thanks! I'll try to take a look over the next few days.
I envision using it by extracting out each "level" into its own data frame, and then trying to dplyr::left_join()
them (maybe with fuzzy matching on row numbers or something like that) into one table...alternatively, make a whole list of data frames (one for each 'section' in the lowest level), using a for-loop to attach on the correct "higher-level" names, and then dplyr::bind_rows()
to combine them.
It's sorta how I used it in the past, where I came up with some convoluted logic based on the color/indentation of the cell - so this would simplify the logic/code, but probably not fundamentally change the logic (for now)
I'm getting a slightly weird result -- I'm not sure if its fixable or if its inherent to Excel.
row_outline_level
is actually col_outline_level
, and vice-versa. If this is just a naming convention by Excel, that's fine -- but I wanted to point that out
col_outline_level
and each col has the same row_outline_level
...I would have thought it the other wayxlsx_cells()
only identifies two of them -- the middle level is "1" and the top and bottom levels are both "2"
row_outline_level
"3", while rows B and C are row_outline_level
s 1 and 2, respectively. By contrast, rows 1-4 each have distinct col_outline_level
s (of 0-3) as I'd expect. My data had similar issues, except my rows had the issue shown by the columns here (and I didn't have column groupings)Thanks again for all your hard work -- I really appreciate all your effort, and the digging into whatever happens under excel
Oops, I think I've fixed those two mistakes. Thanks for checking.
Thanks! The new version works well and indeed fixed the issue!
One last suggestion (although I have no idea if it is difficult/feasible):
I noted that the "top level" of the outline is 0, as is the indicator for any row/col that is not in an outline at all. Is there a way to change this, so that we can distinguish when something is not in an outline at all? (either with NA
or having the outline levels be 1-4 instead of 0-3 for a 4-level outline)
If not, this is still extremely useful, and will definitely take advantage -- thanks for your great work!
Levels are already 1-4 for a 4-level outline. Do you see something different?
I ran the following from the root of this the repo at this branch.
print(tidyxl::xlsx_cells("tests/testthat/outlines.xlsx")[, c("sheet", "row", "col", "numeric", "row_outline_level", "col_outline_level")], n = Inf)
When I ran your code, I saw the numbers of 0-3 -- even though, when I open the Excel file, I see the numbers 1-4 in the "outline" area.
Am I misunderstanding what the results should be?
I see, the button numbering from 1 to 4 implies that level 1 includes every row/col. It's recorded as level zero in the specification, but I'll add 1 to every level.
Thanks for testing this. Feel free to reopen if the fix didn't work.
Hi,
I am sometimes sent Excel files by contributors who group rows, using the "Outline" Function in Excel ( as explained here). Is there a way to identify the outline levels in a document?
My ultimate goal is to "un-outline" the data - and do primary analysis on the lowest level, but keeping the ability to do my own
group_by() %>% summarize()
with the higher level groupings.Thank you! This is a great package!