Open tuetenk0pp opened 1 week ago
@tuetenk0pp - can you include a reproducible example. We cannot reproduce your example without your input.xlsx
file. It would be preferable to have your example not depend on external files at all, e.g.
df = pd.DataFrame(...)
@rhshadrach, here is a more in depth example of what is happening:
>>> import pandas as pd
>>> d = {'location': ['Supermarket', 'Supermarket', 'Supermarket', 'Supermarket', 'Drugstore', 'Drugstore', 'Drugstore', 'Drugstore', 'Farmers Market', 'Farmers Market', 'Farmers Market', 'Farmers Market'], 'category': ['Produce', 'Produce', 'Dry Goods', 'Dry Goods', 'Personal Care', 'Personal Care', 'Medicine', 'Medicine', 'Vegetables', 'Vegetables', 'Fruits', 'Fruits'], 'item': ['Apples', 'Bananas', 'Rice', 'Pasta', 'Shampoo', 'Toothpaste', 'Pain Reliever', 'Cough Sirup', 'Carrots', 'Spinach', 'Strawberrys', 'Oranges'], 'count': [6, 12, 2, 1, 1, 1, 1, 1, 1, 2, 1, 6]}
>>> df = pd.DataFrame(data=d)
>>> df = df.set_index(df.columns[:3].tolist())
>>> df = df.sort_index()
The DataFrame now looks like this:
>>> df
count
location category item
Drugstore Medicine Cough Sirup 1
Pain Reliever 1
Personal Care Shampoo 1
Toothpaste 1
Farmers Market Fruits Oranges 6
Strawberrys 1
Vegetables Carrots 1
Spinach 2
Supermarket Dry Goods Pasta 1
Rice 2
Produce Apples 6
Bananas 12
Now I want to export the DataFrame to LaTeX:
>>> latex = r"""
... \documentclass[border=0.5cm]{standalone}
...
... \usepackage{array}
... \usepackage{multirow}
... \usepackage{booktabs}
...
... \begin{document}
... """
>>> column_format: str = r'p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r'
>>> latex += '\n' + df.to_latex(column_format=column_format, multirow=True) + '\n' + r'\end{document}'
The LaTeX Code now looks like this:
>>> print(latex)
\documentclass[border=0.5cm]{standalone}
\usepackage{array}
\usepackage{multirow}
\usepackage{booktabs}
\begin{document}
\begin{tabular}{p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r}
\toprule
& & & count \\
location & category & item & \\
\midrule
\multirow[t]{4}{*}{Drugstore} & \multirow[t]{2}{*}{Medicine} & Cough Sirup & 1 \\
& & Pain Reliever & 1 \\
\cline{2-4}
& \multirow[t]{2}{*}{Personal Care} & Shampoo & 1 \\
& & Toothpaste & 1 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{*}{Farmers Market} & \multirow[t]{2}{*}{Fruits} & Oranges & 6 \\
& & Strawberrys & 1 \\
\cline{2-4}
& \multirow[t]{2}{*}{Vegetables} & Carrots & 1 \\
& & Spinach & 2 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{*}{Supermarket} & \multirow[t]{2}{*}{Dry Goods} & Pasta & 1 \\
& & Rice & 2 \\
\cline{2-4}
& \multirow[t]{2}{*}{Produce} & Apples & 6 \\
& & Bananas & 12 \\
\cline{1-4} \cline{2-4}
\bottomrule
\end{tabular}
\end{document}
Here I have a screenshot of the compiled document:
As you can see, the multicolumn
does not respect the fixed column width. The Farmers Market
and Personal Care
strings should wrap. This can be fixed with replacing the *
with =
in \multirow{<nrows>}{<width>}{<text>}
:
latex_wrap = latex.replace('{*}', '{=}')
The LaTeX Code now looks like this:
>>> print(latex_wrap)
\documentclass[border=0.5cm]{standalone}
\usepackage{array}
\usepackage{multirow}
\usepackage{booktabs}
\begin{document}
\begin{tabular}{p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r}
\toprule
& & & count \\
location & category & item & \\
\midrule
\multirow[t]{4}{=}{Drugstore} & \multirow[t]{2}{=}{Medicine} & Cough Sirup & 1 \\
& & Pain Reliever & 1 \\
\cline{2-4}
& \multirow[t]{2}{=}{Personal Care} & Shampoo & 1 \\
& & Toothpaste & 1 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{=}{Farmers Market} & \multirow[t]{2}{=}{Fruits} & Oranges & 6 \\
& & Strawberrys & 1 \\
\cline{2-4}
& \multirow[t]{2}{=}{Vegetables} & Carrots & 1 \\
& & Spinach & 2 \\
\cline{1-4} \cline{2-4}
\multirow[t]{4}{=}{Supermarket} & \multirow[t]{2}{=}{Dry Goods} & Pasta & 1 \\
& & Rice & 2 \\
\cline{2-4}
& \multirow[t]{2}{=}{Produce} & Apples & 6 \\
& & Bananas & 12 \\
\cline{1-4} \cline{2-4}
\bottomrule
\end{tabular}
\end{document}
As you can see, the issue no longer exists in the compiled document. The Farmers Market
and Personal Care
strings now wrap into the next line.
[!NOTE] Finding
*
and replacing them with=
becomes much harder, once not every column containing multirows has a fixed width. Therefore the workaround as presented would no longer work. This is why I think this should acutally become a feature. Manipulating other portions oflatex
, e.g. replacing thetabular
environment with atabularx
environment remains simple.
Here is the summarized code for reproduction:
import pandas as pd
import subprocess
d = {'location': ['Supermarket', 'Supermarket', 'Supermarket', 'Supermarket', 'Drugstore', 'Drugstore', 'Drugstore', 'Drugstore', 'Farmers Market', 'Farmers Market', 'Farmers Market', 'Farmers Market'], 'category': ['Produce', 'Produce', 'Dry Goods', 'Dry Goods', 'Personal Care', 'Personal Care', 'Medicine', 'Medicine', 'Vegetables', 'Vegetables', 'Fruits', 'Fruits'], 'item': ['Apples', 'Bananas', 'Rice', 'Pasta', 'Shampoo', 'Toothpaste', 'Pain Reliever', 'Cough Sirup', 'Carrots', 'Spinach', 'Strawberrys', 'Oranges'], 'count': [6, 12, 2, 1, 1, 1, 1, 1, 1, 2, 1, 6]}
df = pd.DataFrame(data=d)
df = df.set_index(df.columns[:3].tolist())
df = df.sort_index()
print(df)
latex = r"""
\documentclass[border=0.5cm]{standalone}
\usepackage{array}
\usepackage{multirow}
\usepackage{booktabs}
\begin{document}
"""
column_format: str = r'p{2cm}<{\raggedright} p{2cm}<{\raggedright} p{2cm}<{\raggedright} r'
latex += '\n' + df.to_latex(column_format=column_format, multirow=True) + '\n' + r'\end{document}'
latex_flex = latex.replace('{*}', '{=}')
with open('multirow.tex', 'w') as f:
f.write(latex)
with open('multirow_flex.tex', 'w') as f:
f.write(latex_flex)
for file in ['multirow.tex', 'multirow_flex.tex']:
cmd = ['latexmk', '-lualatex', '-interaction=nonstopmode', '-file-line-error', '-silent', '-auxdir=./tmp', file]
try:
subprocess.run(cmd)
except:
print(f'could not run command: {cmd}')
Thanks for the additional information! From the OP, you have the logic:
IF columntype NOT ONE OF l, c, r
I do not understand branching on this condition, can you elaborate here?
The multirow
documentation says:
The width can also be given as = when the \multirow entry is given in a column that has a defined width, for example in a p{} column, an X column in tabularx or a L, C, R or J column in a tabulary environment. The text will be set in a \parbox of that width. If you give “=” in other situations, you will get strange results (usually a too wide column).
So this actually means that multirow
is able to adjust the width of the cell if the column has a defined width. This is generally the case for the LaTeX column types that are mentioned in the quote; so not for the standard column types l
, c
and r
although the column type S
from the siunitx
package does not define a column width as well.
Thats why I thought it would be a good idea to check the columntype (from the column_format
argument of to_latex()
) and return the \multirow ...
string accordingly.
With the added S
column type the logic would look like this:
IF columntype NOT ONE OF l, c, r, S
RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{*}}{{{display_val}}}"
ELSE
RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{=}}{{{display_val}}}"
Or a more safe approach but with no support for any other column types than mentioned in the quote above:
IF columntype ONE OF p{}, m{}, b{}, X, L, C, R, J
RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{=}}{{{display_val}}}"
ELSE
RETURN f"\\multirow[{multirow_align}]{{{rowspan}}}{{*}}{{{display_val}}}"
Thanks, PRs to fix are welcome!
cc @attack68
What happens if "=" is given in all cases? Does it still compile even for 'standard' columns?
What happens if "=" is given in all cases? Does it still compile even for 'standard' columns?
The documentation states:
If you give “=” in other situations, you will get strange results (usually a too wide column).
This is quite difficult to code I think. There are no natural attributes that contain any information as to the type of column being parsed, which means they would be have to be computed and dynamically passed to the existing rendering functions.
We have established that using *
causes some layout errors in these cases. If we switched to using =
permanently would the other 'strange' behaviour be worse for the standard cases or would it be less noticeble overall. Making this change would be an adoption of the least bad approach.
Alternatively can we just document a suggested approach for the minimal number of users who might encounter this. I would suggest something along the lines of pattern search with re and then replacing. Some pattern like..
import re
pattern = re.compile("\\\\multirow\\[.*?\\]\\{.*?\\}\\{\\*\\}")
pattern.search(s)
# <re.Match object; span=(173, 191), match='\\multirow[t]{4}{*}'>
We have established that using
*
causes some layout errors in these cases. If we switched to using=
permanently would the other 'strange' behaviour be worse for the standard cases or would it be less noticeble overall. Making this change would be an adoption of the least bad approach.
What about conditionally branching on column_format
as @tuetenk0pp has suggested?
Feature Type
[X] Adding new functionality to pandas
[X] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas
Problem Description
It would be nice to be able to use the
=
option with multirow. Maybe this is something that can be decided automatically according to the column type.Feature Description
Add logic to
_parse_latex_header_span()
to check for cloumn types.Alternative Solutions
One could also find/replace the return value of
pandas.DataFrame.to_latex
manually:Additional Context
https://github.com/pandas-dev/pandas/blob/156e67e1c6f32d2a794b61604a9f89d411e5e915/pandas/io/formats/style_render.py#L2451
See this snippet from the multirow documentation: