pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.84k stars 18k forks source link

ENH: use '=' width in multirow with compatible column types #60281

Open tuetenk0pp opened 1 week ago

tuetenk0pp commented 1 week ago

Feature Type

Problem Description

It would be nice to be able to use the = option with multirow. Maybe this is something that can be decided automatically according to the column type.

Feature Description

Add logic to _parse_latex_header_span() to check for cloumn types.

IF columntype NOT ONE OF l, c, r
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{*}}{{{display_val}}}"
ELSE
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{=}}{{{display_val}}}"

Alternative Solutions

One could also find/replace the return value of pandas.DataFrame.to_latex manually:

df = pd.read_excel("input.xlsx")
df = df.set_index(df.columns[:3].tolist())
df = df.sort_index(multirow=True)

latex = df.to_latex()
latex = latex.replace("{*}", "{=}")

with open("output.tex", "w") as f:
    f.write(latex)

Additional Context

https://github.com/pandas-dev/pandas/blob/156e67e1c6f32d2a794b61604a9f89d411e5e915/pandas/io/formats/style_render.py#L2451

See this snippet from the multirow documentation:

The width can also be given as = when the \multirow entry is given in a column that has a defined width, for example in a p{} column, an X column in tabularx or a L, C, R or J column in a tabulary environment. The text will be set in a \parbox of that width. If you give “=” in other situations, you will get strange results (usually a too wide column).

rhshadrach commented 1 week ago

@tuetenk0pp - can you include a reproducible example. We cannot reproduce your example without your input.xlsx file. It would be preferable to have your example not depend on external files at all, e.g.

df = pd.DataFrame(...)
tuetenk0pp commented 1 week ago

@rhshadrach, here is a more in depth example of what is happening:

>>> import pandas as pd
>>> d = {'location': ['Supermarket', 'Supermarket', 'Supermarket', 'Supermarket', 'Drugstore', 'Drugstore', 'Drugstore', 'Drugstore', 'Farmers Market', 'Farmers Market', 'Farmers Market', 'Farmers Market'], 'category': ['Produce', 'Produce', 'Dry Goods', 'Dry Goods', 'Personal Care', 'Personal Care', 'Medicine', 'Medicine', 'Vegetables', 'Vegetables', 'Fruits', 'Fruits'], 'item': ['Apples', 'Bananas', 'Rice', 'Pasta', 'Shampoo', 'Toothpaste', 'Pain Reliever', 'Cough Sirup', 'Carrots', 'Spinach', 'Strawberrys', 'Oranges'], 'count': [6, 12, 2, 1, 1, 1, 1, 1, 1, 2, 1, 6]}
>>> df = pd.DataFrame(data=d)
>>> df = df.set_index(df.columns[:3].tolist())
>>> df = df.sort_index()

The DataFrame now looks like this:

>>> df
                                            count
location       category      item
Drugstore      Medicine      Cough Sirup        1
                             Pain Reliever      1
               Personal Care Shampoo            1
                             Toothpaste         1
Farmers Market Fruits        Oranges            6
                             Strawberrys        1
               Vegetables    Carrots            1
                             Spinach            2
Supermarket    Dry Goods     Pasta              1
                             Rice               2
               Produce       Apples             6
                             Bananas           12

Now I want to export the DataFrame to LaTeX:

>>> latex = r"""
... \documentclass[border=0.5cm]{standalone}
...
... \usepackage{array}
... \usepackage{multirow}
... \usepackage{booktabs}
... 
... \begin{document}
... """
>>> column_format: str = r'p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r'
>>> latex += '\n' + df.to_latex(column_format=column_format, multirow=True) + '\n' + r'\end{document}'

The LaTeX Code now looks like this:

>>> print(latex)

\documentclass[border=0.5cm]{standalone}

\usepackage{array}
\usepackage{multirow}
\usepackage{booktabs}

\begin{document}

\begin{tabular}{p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r}
\toprule
 &  &  & count \\
location & category & item &  \\
\midrule
\multirow[t]{4}{*}{Drugstore} & \multirow[t]{2}{*}{Medicine} & Cough Sirup & 1 \\
 &  & Pain Reliever & 1 \\
\cline{2-4}
 & \multirow[t]{2}{*}{Personal Care} & Shampoo & 1 \\
 &  & Toothpaste & 1 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{*}{Farmers Market} & \multirow[t]{2}{*}{Fruits} & Oranges & 6 \\
 &  & Strawberrys & 1 \\
\cline{2-4}
 & \multirow[t]{2}{*}{Vegetables} & Carrots & 1 \\
 &  & Spinach & 2 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{*}{Supermarket} & \multirow[t]{2}{*}{Dry Goods} & Pasta & 1 \\
 &  & Rice & 2 \\
\cline{2-4}
 & \multirow[t]{2}{*}{Produce} & Apples & 6 \\
 &  & Bananas & 12 \\
\cline{1-4} \cline{2-4}
\bottomrule
\end{tabular}

\end{document}

Here I have a screenshot of the compiled document: grafik

As you can see, the multicolumn does not respect the fixed column width. The Farmers Market and Personal Care strings should wrap. This can be fixed with replacing the * with = in \multirow{<nrows>}{<width>}{<text>}:

latex_wrap = latex.replace('{*}', '{=}')

The LaTeX Code now looks like this:

>>> print(latex_wrap)

\documentclass[border=0.5cm]{standalone}

\usepackage{array}
\usepackage{multirow}
\usepackage{booktabs}

\begin{document}

\begin{tabular}{p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r}
\toprule
 &  &  & count \\
location & category & item &  \\
\midrule
\multirow[t]{4}{=}{Drugstore} & \multirow[t]{2}{=}{Medicine} & Cough Sirup & 1 \\
 &  & Pain Reliever & 1 \\
\cline{2-4}
 & \multirow[t]{2}{=}{Personal Care} & Shampoo & 1 \\
 &  & Toothpaste & 1 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{=}{Farmers Market} & \multirow[t]{2}{=}{Fruits} & Oranges & 6 \\
 &  & Strawberrys & 1 \\
\cline{2-4}
 & \multirow[t]{2}{=}{Vegetables} & Carrots & 1 \\
 &  & Spinach & 2 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{=}{Supermarket} & \multirow[t]{2}{=}{Dry Goods} & Pasta & 1 \\
 &  & Rice & 2 \\
\cline{2-4}
 & \multirow[t]{2}{=}{Produce} & Apples & 6 \\
 &  & Bananas & 12 \\
\cline{1-4} \cline{2-4}
\bottomrule
\end{tabular}

\end{document}

As you can see, the issue no longer exists in the compiled document. The Farmers Market and Personal Care strings now wrap into the next line. grafik

[!NOTE] Finding * and replacing them with = becomes much harder, once not every column containing multirows has a fixed width. Therefore the workaround as presented would no longer work. This is why I think this should acutally become a feature. Manipulating other portions of latex, e.g. replacing the tabular environment with a tabularx environment remains simple.

Here is the summarized code for reproduction:

import pandas as pd
import subprocess

d = {'location': ['Supermarket', 'Supermarket', 'Supermarket', 'Supermarket', 'Drugstore', 'Drugstore', 'Drugstore', 'Drugstore', 'Farmers Market', 'Farmers Market', 'Farmers Market', 'Farmers Market'], 'category': ['Produce', 'Produce', 'Dry Goods', 'Dry Goods', 'Personal Care', 'Personal Care', 'Medicine', 'Medicine', 'Vegetables', 'Vegetables', 'Fruits', 'Fruits'], 'item': ['Apples', 'Bananas', 'Rice', 'Pasta', 'Shampoo', 'Toothpaste', 'Pain Reliever', 'Cough Sirup', 'Carrots', 'Spinach', 'Strawberrys', 'Oranges'], 'count': [6, 12, 2, 1, 1, 1, 1, 1, 1, 2, 1, 6]}

df = pd.DataFrame(data=d)

df = df.set_index(df.columns[:3].tolist())
df = df.sort_index()
print(df)

latex = r"""
\documentclass[border=0.5cm]{standalone}

\usepackage{array}
\usepackage{multirow}
\usepackage{booktabs}

\begin{document}
"""

column_format: str = r'p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r'

latex += '\n' + df.to_latex(column_format=column_format, multirow=True) + '\n' + r'\end{document}'

latex_flex = latex.replace('{*}', '{=}')

with open('multirow.tex', 'w') as f:
    f.write(latex)

with open('multirow_flex.tex', 'w') as f:
    f.write(latex_flex)

for file in ['multirow.tex', 'multirow_flex.tex']:

    cmd = ['latexmk', '-lualatex', '-interaction=nonstopmode', '-file-line-error', '-silent', '-auxdir=./tmp', file]

    try:
        subprocess.run(cmd)
    except:
        print(f'could not run command: {cmd}')
rhshadrach commented 1 week ago

Thanks for the additional information! From the OP, you have the logic:

IF columntype NOT ONE OF l, c, r

I do not understand branching on this condition, can you elaborate here?

tuetenk0pp commented 1 week ago

The multirow documentation says:

The width can also be given as = when the \multirow entry is given in a column that has a defined width, for example in a p{} column, an X column in tabularx or a L, C, R or J column in a tabulary environment. The text will be set in a \parbox of that width. If you give “=” in other situations, you will get strange results (usually a too wide column).

So this actually means that multirow is able to adjust the width of the cell if the column has a defined width. This is generally the case for the LaTeX column types that are mentioned in the quote; so not for the standard column types l, c and r although the column type S from the siunitx package does not define a column width as well.

Thats why I thought it would be a good idea to check the columntype (from the column_format argument of to_latex()) and return the \multirow ... string accordingly.

With the added S column type the logic would look like this:

IF columntype NOT ONE OF l, c, r, S
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{*}}{{{display_val}}}"
ELSE
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{=}}{{{display_val}}}"

Or a more safe approach but with no support for any other column types than mentioned in the quote above:

IF columntype ONE OF p{}, m{}, b{}, X, L, C, R, J
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{=}}{{{display_val}}}"
ELSE
    RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{*}}{{{display_val}}}"
rhshadrach commented 5 days ago

Thanks, PRs to fix are welcome!

cc @attack68

attack68 commented 5 days ago

What happens if "=" is given in all cases? Does it still compile even for 'standard' columns?

tuetenk0pp commented 5 days ago

What happens if "=" is given in all cases? Does it still compile even for 'standard' columns?

The documentation states:

If you give “=” in other situations, you will get strange results (usually a too wide column).

attack68 commented 3 days ago

This is quite difficult to code I think. There are no natural attributes that contain any information as to the type of column being parsed, which means they would be have to be computed and dynamically passed to the existing rendering functions.

We have established that using * causes some layout errors in these cases. If we switched to using = permanently would the other 'strange' behaviour be worse for the standard cases or would it be less noticeble overall. Making this change would be an adoption of the least bad approach.

Alternatively can we just document a suggested approach for the minimal number of users who might encounter this. I would suggest something along the lines of pattern search with re and then replacing. Some pattern like..

import re
pattern = re.compile("\\\\multirow\\[.*?\\]\\{.*?\\}\\{\\*\\}")
pattern.search(s)
# <re.Match object; span=(173, 191), match='\\multirow[t]{4}{*}'>
rhshadrach commented 3 days ago

We have established that using * causes some layout errors in these cases. If we switched to using = permanently would the other 'strange' behaviour be worse for the standard cases or would it be less noticeble overall. Making this change would be an adoption of the least bad approach.

What about conditionally branching on column_format as @tuetenk0pp has suggested?