pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.77k stars 17.96k forks source link

ENH: use '=' width in multirow with compatible column types #60281

Open tuetenk0pp opened 2 days ago

tuetenk0pp commented 2 days ago

Feature Type

Problem Description

It would be nice to be able to use the = option with multirow. Maybe this is something that can be decided automatically according to the column type.

Feature Description

Add logic to _parse_latex_header_span() to check for cloumn types.

IF columntype NOT ONE OF l, c, r
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{*}}{{{display_val}}}"
ELSE
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{=}}{{{display_val}}}"

Alternative Solutions

One could also find/replace the return value of pandas.DataFrame.to_latex manually:

df = pd.read_excel("input.xlsx")
df = df.set_index(df.columns[:3].tolist())
df = df.sort_index(multirow=True)

latex = df.to_latex()
latex = latex.replace("{*}", "{=}")

with open("output.tex", "w") as f:
    f.write(latex)

Additional Context

https://github.com/pandas-dev/pandas/blob/156e67e1c6f32d2a794b61604a9f89d411e5e915/pandas/io/formats/style_render.py#L2451

See this snippet from the multirow documentation:

The width can also be given as = when the \multirow entry is given in a column that has a defined width, for example in a p{} column, an X column in tabularx or a L, C, R or J column in a tabulary environment. The text will be set in a \parbox of that width. If you give “=” in other situations, you will get strange results (usually a too wide column).

rhshadrach commented 2 days ago

@tuetenk0pp - can you include a reproducible example. We cannot reproduce your example without your input.xlsx file. It would be preferable to have your example not depend on external files at all, e.g.

df = pd.DataFrame(...)
tuetenk0pp commented 2 days ago

@rhshadrach, here is a more in depth example of what is happening:

>>> import pandas as pd
>>> d = {'location': ['Supermarket', 'Supermarket', 'Supermarket', 'Supermarket', 'Drugstore', 'Drugstore', 'Drugstore', 'Drugstore', 'Farmers Market', 'Farmers Market', 'Farmers Market', 'Farmers Market'], 'category': ['Produce', 'Produce', 'Dry Goods', 'Dry Goods', 'Personal Care', 'Personal Care', 'Medicine', 'Medicine', 'Vegetables', 'Vegetables', 'Fruits', 'Fruits'], 'item': ['Apples', 'Bananas', 'Rice', 'Pasta', 'Shampoo', 'Toothpaste', 'Pain Reliever', 'Cough Sirup', 'Carrots', 'Spinach', 'Strawberrys', 'Oranges'], 'count': [6, 12, 2, 1, 1, 1, 1, 1, 1, 2, 1, 6]}
>>> df = pd.DataFrame(data=d)
>>> df = df.set_index(df.columns[:3].tolist())
>>> df = df.sort_index()

The DataFrame now looks like this:

>>> df
                                            count
location       category      item
Drugstore      Medicine      Cough Sirup        1
                             Pain Reliever      1
               Personal Care Shampoo            1
                             Toothpaste         1
Farmers Market Fruits        Oranges            6
                             Strawberrys        1
               Vegetables    Carrots            1
                             Spinach            2
Supermarket    Dry Goods     Pasta              1
                             Rice               2
               Produce       Apples             6
                             Bananas           12

Now I want to export the DataFrame to LaTeX:

>>> latex = r"""
... \documentclass[border=0.5cm]{standalone}
...
... \usepackage{array}
... \usepackage{multirow}
... \usepackage{booktabs}
... 
... \begin{document}
... """
>>> column_format: str = r'p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r'
>>> latex += '\n' + df.to_latex(column_format=column_format, multirow=True) + '\n' + r'\end{document}'

The LaTeX Code now looks like this:

>>> print(latex)

\documentclass[border=0.5cm]{standalone}

\usepackage{array}
\usepackage{multirow}
\usepackage{booktabs}

\begin{document}

\begin{tabular}{p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r}
\toprule
 &  &  & count \\
location & category & item &  \\
\midrule
\multirow[t]{4}{*}{Drugstore} & \multirow[t]{2}{*}{Medicine} & Cough Sirup & 1 \\
 &  & Pain Reliever & 1 \\
\cline{2-4}
 & \multirow[t]{2}{*}{Personal Care} & Shampoo & 1 \\
 &  & Toothpaste & 1 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{*}{Farmers Market} & \multirow[t]{2}{*}{Fruits} & Oranges & 6 \\
 &  & Strawberrys & 1 \\
\cline{2-4}
 & \multirow[t]{2}{*}{Vegetables} & Carrots & 1 \\
 &  & Spinach & 2 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{*}{Supermarket} & \multirow[t]{2}{*}{Dry Goods} & Pasta & 1 \\
 &  & Rice & 2 \\
\cline{2-4}
 & \multirow[t]{2}{*}{Produce} & Apples & 6 \\
 &  & Bananas & 12 \\
\cline{1-4} \cline{2-4}
\bottomrule
\end{tabular}

\end{document}

Here I have a screenshot of the compiled document: grafik

As you can see, the multicolumn does not respect the fixed column width. The Farmers Market and Personal Care strings should wrap. This can be fixed with replacing the * with = in \multirow{<nrows>}{<width>}{<text>}:

latex_wrap = latex.replace('{*}', '{=}')

The LaTeX Code now looks like this:

>>> print(latex_wrap)

\documentclass[border=0.5cm]{standalone}

\usepackage{array}
\usepackage{multirow}
\usepackage{booktabs}

\begin{document}

\begin{tabular}{p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r}
\toprule
 &  &  & count \\
location & category & item &  \\
\midrule
\multirow[t]{4}{=}{Drugstore} & \multirow[t]{2}{=}{Medicine} & Cough Sirup & 1 \\
 &  & Pain Reliever & 1 \\
\cline{2-4}
 & \multirow[t]{2}{=}{Personal Care} & Shampoo & 1 \\
 &  & Toothpaste & 1 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{=}{Farmers Market} & \multirow[t]{2}{=}{Fruits} & Oranges & 6 \\
 &  & Strawberrys & 1 \\
\cline{2-4}
 & \multirow[t]{2}{=}{Vegetables} & Carrots & 1 \\
 &  & Spinach & 2 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{=}{Supermarket} & \multirow[t]{2}{=}{Dry Goods} & Pasta & 1 \\
 &  & Rice & 2 \\
\cline{2-4}
 & \multirow[t]{2}{=}{Produce} & Apples & 6 \\
 &  & Bananas & 12 \\
\cline{1-4} \cline{2-4}
\bottomrule
\end{tabular}

\end{document}

As you can see, the issue no longer exists in the compiled document. The Farmers Market and Personal Care strings now wrap into the next line. grafik

[!NOTE] Finding * and replacing them with = becomes much harder, once not every column containing multirows has a fixed width. Therefore the workaround as presented would no longer work. This is why I think this should acutally become a feature. Manipulating other portions of latex, e.g. replacing the tabular environment with a tabularx environment remains simple.

Here is the summarized code for reproduction:

import pandas as pd
import subprocess

d = {'location': ['Supermarket', 'Supermarket', 'Supermarket', 'Supermarket', 'Drugstore', 'Drugstore', 'Drugstore', 'Drugstore', 'Farmers Market', 'Farmers Market', 'Farmers Market', 'Farmers Market'], 'category': ['Produce', 'Produce', 'Dry Goods', 'Dry Goods', 'Personal Care', 'Personal Care', 'Medicine', 'Medicine', 'Vegetables', 'Vegetables', 'Fruits', 'Fruits'], 'item': ['Apples', 'Bananas', 'Rice', 'Pasta', 'Shampoo', 'Toothpaste', 'Pain Reliever', 'Cough Sirup', 'Carrots', 'Spinach', 'Strawberrys', 'Oranges'], 'count': [6, 12, 2, 1, 1, 1, 1, 1, 1, 2, 1, 6]}

df = pd.DataFrame(data=d)

df = df.set_index(df.columns[:3].tolist())
df = df.sort_index()
print(df)

latex = r"""
\documentclass[border=0.5cm]{standalone}

\usepackage{array}
\usepackage{multirow}
\usepackage{booktabs}

\begin{document}
"""

column_format: str = r'p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r'

latex += '\n' + df.to_latex(column_format=column_format, multirow=True) + '\n' + r'\end{document}'

latex_flex = latex.replace('{*}', '{=}')

with open('multirow.tex', 'w') as f:
    f.write(latex)

with open('multirow_flex.tex', 'w') as f:
    f.write(latex_flex)

for file in ['multirow.tex', 'multirow_flex.tex']:

    cmd = ['latexmk', '-lualatex', '-interaction=nonstopmode', '-file-line-error', '-silent', '-auxdir=./tmp', file]

    try:
        subprocess.run(cmd)
    except:
        print(f'could not run command: {cmd}')
rhshadrach commented 1 day ago

Thanks for the additional information! From the OP, you have the logic:

IF columntype NOT ONE OF l, c, r

I do not understand branching on this condition, can you elaborate here?

tuetenk0pp commented 1 day ago

The multirow documentation says:

The width can also be given as = when the \multirow entry is given in a column that has a defined width, for example in a p{} column, an X column in tabularx or a L, C, R or J column in a tabulary environment. The text will be set in a \parbox of that width. If you give “=” in other situations, you will get strange results (usually a too wide column).

So this actually means that multirow is able to adjust the width of the cell if the column has a defined width. This is generally the case for the LaTeX column types that are mentioned in the quote; so not for the standard column types l, c and r although the column type S from the siunitx package does not define a column width as well.

Thats why I thought it would be a good idea to check the columntype (from the column_format argument of to_latex()) and return the \multirow ... string accordingly.

With the added S column type the logic would look like this:

IF columntype NOT ONE OF l, c, r, S
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{*}}{{{display_val}}}"
ELSE
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{=}}{{{display_val}}}"

Or a more safe approach but with no support for any other column types than mentioned in the quote above:

IF columntype ONE OF p{}, m{}, b{}, X, L, C, R, J
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{=}}{{{display_val}}}"
ELSE
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{*}}{{{display_val}}}"