To enable merged cells - Githubissues

aborruso commented 9 months ago

Proposal

Sometimes I have spreadsheets with this kind of table:

+---------------------+----------+
| Property            | Earth    |
+=============+=======+==========+
|             | min   | -89.2 °C |
| Temperature +-------+----------+
| 1961-1990   | mean  | 14 °C    |
|             +-------+----------+
|             | max   | 56.7 °C  |
+-------------+-------+----------+

When I need to use in quarto, it's difficult to do it, because I do not have a tool to convert in example a xlsx file to this kind of pandoc grid_tables. There is no way to keep the merged cells.

Using python, a tool to do it could be great-tables: I could read the xlsx table using pandas and then use great-tables to render properly an HTML table with rowspan.

Thank you to @cderv for the idea.

machow commented 9 months ago

Thanks for opening this issue. I've added the data from your example as an xlsx file, so we can explore a bit!

example-gt.xlsx

I think that in general, it seems like there are two challenges in this issue:

reading the spreadsheet into a DataFrame and tidying it
outputting a grid table from that DataFrame

Reading and tidying

Here we read the excel spreadsheet using pandas.

import pandas as pd

df = pd.read_excel("https://github.com/posit-dev/great-tables/files/13821956/example-gt.xlsx")

                Property Unnamed: 1     Earth
0  Temperature 1961-1990        min  -89.2 °C
1                    NaN       mean     14 °C
2                    NaN        max   56.7 °C

Note that the merged rows in the Property column ended up with NaNs in them, and there is an unnamed column. To fix this, we can forward fill and do a rename.

tidy = df.assign(Property = df["Property"].ffill()).rename(columns={"Unnamed: 1": "Measure"})
tidy

                Property Measure     Earth
0  Temperature 1961-1990     min  -89.2 °C
1  Temperature 1961-1990    mean     14 °C
2  Temperature 1961-1990     max   56.7 °C

Outputting as a grid

With the data tidied, we can produce something like a grid table, by setting the rowname_col="Measure", and groupname_col="Property".

import pandas as pd
from great_tables import GT

df = pd.read_excel("https://github.com/posit-dev/great-tables/files/13821956/example-gt.xlsx")
tidy = df.assign(Property = df["Property"].ffill()).rename(columns={"Unnamed: 1": "Measure"})

gt = GT(tidy, rowname_col="Measure", groupname_col="Property")
gt

Note that it differs a bit from you example output though. I wonder if there's another way this is supported in the R gt library?

aborruso commented 9 months ago

Thank you very much @machow very useful and didactic. By also doing it directly in excel, I was somehow already able to.

I like this result of yours and it is very readable, but I would like to see if it is useful (also for this great project of yours), to introduce something like groupname_row, something that allows you to group rows. Just a thought.

I wonder if there's another way this is supported in the R gt library?

I do not know if there is a gt way, but I know this other R way created by @Steinthal.

Steinthal commented 9 months ago

Well, @machow, I’d like to caution you that the example given be @aborruso is quite specific.

What if there’s another set of combined cells for another period again with min/mean/max values and those values shall be presented directly above or below those of the earlier period? Then the arrangement presented or requested by @aborruso would make more sense than placing the period above the values.

What I mean is: There are different possible ways to present data depending on the amount of data and requirements – and a solution should be flexible enough to support them all (or at least most of them).

machow commented 9 months ago

@Steinthal if I understand correctly, are you saying that something like this...

import pandas as pd
from great_tables import GT

df = pd.DataFrame({
    "Property": ["Temp 1961"] * 2 + ["Temp 1962"] * 2,
    "Measure": ["min", "max"] * 2,
    "Earth": [1, 2, 3, 4]
})

GT(df, rowname_col="Measure", groupname_col="Period")

Isn't ideal because the property values (e.g. Temp 1961) shouldn't go across the full dataframe? I wonder if the gt for R row_group_as_column argument could make this look a bit better (even if it doesn't provide a general solution for spanning rows)?

Note that above, the groupings (e.g. grp_a) are their own column, and span rows.

If you were thinking of other cases, seeing some examples might help me get a handle on the important ones to support.

posit-dev / great-tables

To enable merged cells #104

Proposal

Reading and tidying

Outputting as a grid