pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.94k forks source link

ENH: Styler.to_latex() #21673

Closed toobaz closed 2 years ago

toobaz commented 6 years ago

I have created a branch adding a to_latex method to Styler.

It is nowhere near readiness, and in particular:

This said, if anyone feels like experimenting, it might be slightly better than starting from scratch.

One aspect which I think is useful, even aside from this branch, and might benefit some discussion, is the pair of methods _latex_preserve and _latex_restore, which basically replace LaTeX commands so that they are not disturbed by escaping, and then restores them. There might be better way to code this, but I really think this is something we need to implement, and to offer to users who happen to nest LaTeX code in their cells.

moi90 commented 3 years ago

I'm really not happy with the weird syntax to squeeze LaTeX command into CSS... It gets worse, if a user wants to use siunitxs S columns. (I believe this is a pretty standard setup.)

Here, I made a demonstration which steps are required to format tables correctly when using siunitx. Basically, a cell has to be formatted the following way:

markup = protected + unprotected
protected = ["{\cellcolor{...}"}]
unprotected = ["\color{...}"] + ["\bfseries"] + ["\textit"]

It would be really annoying to get this using the proposed CSS syntax.

ExcelFormatter has a CSSToExcelConverter that uses a CSSResolver to parse the markup and generate the markup required by Excel.

Styler.to_latex should do the same.

While flexibility is great, I think it is more important to cover the common cases with ease (background-color, color, font-weight: bold, font-style: italic). (Also, it will not be very hard to extend the existing protected and unprotected formatting commands.)

attack68 commented 3 years ago

@moi90 allowing Styler to operate as CSS ('attribute', 'value') pairs or as LaTeX ('command', 'options') creates a non-duplicated and maintainable codebase that is flexible in both formats. Another advantage is that the unit tests and formatting methods are available in both versions.

You should not eliminate all of this flexibility, for the sake of covering the common cases, which I have said are already very easy to translate into the LaTeX format. See this unpublished PR (not included in my original PR because this is an extension which just complicates the PR reviewers)

So far the parameters that you have raised that have been solved are:

Very much this discussion has followed the line of you challenging my development's capabilities. This has been valuable since it has driven the development of these options. However, without a suitable challenger model it is difficult for me to question how you plan to deal with some of these items you have raised and others, if you still believe that working with a LateXFormatter is preferable (I don't)?

Here is an example of the extension module from the above PR: Screen Shot 2021-03-28 at 17 50 40

moi90 commented 3 years ago

@attack68 I greatly appreciate your skills as a developer and am impressed that you have always skillfully addressed my "challenges". :+1:

I'm sorry I haven't come up with a challenger model. I tried to dig into the existing code from multiple angles but I currently lack the time to get to a solution that would worth discussing. Also, you convinced me that using Jinja is not so bad after all.

@moi90 allowing Styler to operate as CSS ('attribute', 'value') pairs or as LaTeX ('command', 'options') creates a non-duplicated and maintainable codebase that is flexible in both formats. Another advantage is that the unit tests and formatting methods are available in both versions.

I don't see how a translation between CSS properties and LaTeX commands would lead to more code duplication or less maintainability.

automatic alignment of columns r for numeric and l for non-numeric (in .to_latex())

Maybe we can have an additional option for siunitx that uses the S column type for numeric columns?

automatic alignment of columns r for numeric and l for non-numeric (in .render(latex=True) - this is harder to implement without code duplication)

Maybe we can drop render(latex=True) altogether? What is the advantage over to_latex?

How do you differentiate between bfseries and textbf, or textit and emph i.e. the cases where there is not a 1-1 direct CSS translation?

How do you plan to deal with the positioning of braces which may be different for siunitx or other packages?

I answered here in detail. Basically: Always be compatible with siunitx; this does not harm other setups.

How do you plan on translating font-size for which Large and Huge are not valid CSS values?

I would make this a problem. On the CSS side, there is large, x-large, xx-large, and xxx-large. On the LaTeX side, there is \large, \Large, \LARGE, \huge, \Huge. These can be matched nicely (when excluding \Huge). (Same for the small sizes.)

Here is an example of the extension module from the above PR

Cool! (Albeit not yet siunitx-compatible.)

jreback commented 3 years ago

if anyone would like to review / try out / comment on the PR https://github.com/pandas-dev/pandas/pull/40422 would be appreciated.

alevinetx commented 3 years ago

As an end user, when I have a Styler object that renders fine while viewing in a notebook, but exported to PDF it becomes class display (....Styler), do I need to explicitly call .toLatex() or will that be done behind the scenes?

Thank you for working on this issue!

asapsmc commented 2 years ago

Could you please provide some more advanced examples in documentation of usage of to_latex()?

attack68 commented 2 years ago

what, specifically , would you like to see?

asapsmc commented 2 years ago

Building more complex tables, with multicols and multirows, other Latex elements such as \cmidrule and if possible (can't understand from the documentation if that's possible or not) if there's way to include logic in the table building (e.g. include \midrule after a specific row, etc.). I have to publish lot's of tables in LateX, but I'm unable to explore all possibilities with the current examples.

attack68 commented 2 years ago

Building more complex tables, with multicols and multirows,

You need a MultiIndex and use the mulitrow_align, multicol_align, sparse_index, sparse_columns arguments. The rest is handled by default. Data values will never be multi-columned or multi-rowed, only indexes.

other Latex elements such as \cmidrule and if possible (can't understand from the documentation if that's possible or not) if there's way to include logic in the table building (e.g. include \midrule after a specific row, etc.).

No you cant add conditional out of cell logic. The only custom commands you can add are as described in the docs page, akin to the example for \rowcolors{1}{pink}{red}

asapsmc commented 2 years ago

Thanks for your feedback. But for a novice user like me, it's been impossible to make it work from the examples (both in Styler.to_latex() and Dataframe.to_latex()).

moi90 commented 2 years ago

include \midrule after a specific row

For that, I usually split the generated output into individual lines, insert the extra rules at pre-defined locations and concatenate the result. This breaks easily but it is better than nothing.

asapsmc commented 2 years ago

include \midrule after a specific row

For that, I usually split the generated output into individual lines, insert the extra rules at pre-defined locations and concatenate the result. This breaks easily but it is better than nothing. @moi90: How do you split the generated output into individual lines?

asapsmc commented 2 years ago

(Sorry to put this here, but I'm finding several questions on StackOverflow (e.g. this or this) unanswered, and my time is running out, so I bring this question into where I know there is knowledge to solve this. Please excuse me)

why Styler.to_latex() does not produce the same outputs (namely the \cline) that DataFrame.to_latex().

My original dataframe after aggregating results (dfg) is this:

                   F-1   F-2
dataset Model               
G       Baseline 5.825 5.804
        Version2 5.825 5.804
H       Baseline 4.677 4.571
        Version2 4.802 4.660
S       Baseline 2.406 1.921
        Version2 2.719 2.189
T       Baseline 5.284 4.949
        Version2 5.931 5.909 

Then I use the following code:

pd.options.display.float_format = '{:,.3f}'.format    
styler_latex = dfg.style.to_latex(position="H", hrules=True, multirow_align="c", multicol_align="r", sparse_index=True)
dfg_latex = dfg.to_latex(position='H', escape=False, sparsify=True, multirow=True, multicolumn=True)
print('styler:', styler_latex)
print('dfg_latex', dfg_latex)

Output from Styler.to_latex() (Latex Code and Image)

\begin{table}[H]
\centering
\begin{tabular}{llrr}
\toprule
{} & {} & {F-1} & {F-2} \\
{dataset} & {Model} & {} & {} \\
\midrule
\multirow[c]{2}{*}{G} & Baseline & 5.824811 & 5.804303 \\
 & Version2 & 5.824811 & 5.804303 \\
\multirow[c]{2}{*}{H} & Baseline & 4.677066 & 4.570626 \\
 & Version2 & 4.801857 & 4.660115 \\
\multirow[c]{2}{*}{S} & Baseline & 2.406244 & 1.921260 \\
 & Version2 & 2.719123 & 2.189293 \\
\multirow[c]{2}{*}{T} & Baseline & 5.284241 & 4.949087 \\
 & Version2 & 5.931376 & 5.909215 \\
\bottomrule
\end{tabular}
\end{table}

enter image description here

Output from DataFrame.to_latex() (Latex Code and Image)

\begin{table}[H]
    \centering
    \begin{tabular}{llrr}
    \toprule
      &          &   F-1 &   F-2 \\
    dataset & Model &       &       \\
    \midrule
    \multirow{2}{*}{G} & Baseline & 5.825 & 5.804 \\
      & Version2 & 5.825 & 5.804 \\
    \cline{1-4}
    \multirow{2}{*}{H} & Baseline & 4.677 & 4.571 \\
      & Version2 & 4.802 & 4.660 \\
    \cline{1-4}
    \multirow{2}{*}{S} & Baseline & 2.406 & 1.921 \\
      & Version2 & 2.719 & 2.189 \\
    \cline{1-4}
    \multirow{2}{*}{T} & Baseline & 5.284 & 4.949 \\
      & Version2 & 5.931 & 5.909 \\
    \bottomrule
    \end{tabular}
    \end{table}

enter image description here

Questions:

  1. Why does not Styler.to_latex() include \cline, contrarily to DataFrame.to_latex() ? Is there any way to "force" this behaviour into Styler.to_latex() . Why does not Styler.to_latex() include \cline, contrarily to DataFrame.to_latex() ? Is there any way to "force" this behaviour into Styler.to_latex() ?

I tried to do

my_dfstyle = my_dfstyle.set_table_styles([
        {'selector': 'toprule', 'props': ':toprule;'},
        {'selector': 'midrule', 'props': ':midrule;'},
        {'selector': 'bottomrule', 'props': ':bottomrule;'},
    ], overwrite=False)

but I was unsuccessful. Is there any way to accomplish this type of control (e.g. force \midrule between multirows)?

  1. In a report I wouldn't like to see table headers with 2 lines (as in the above tables where one line would suffice). But to achieve that, I have to reset index, and then I lose the ability of multirows (e.g. in the dataset column). Is there any way to circumvent this? Is it possible to "merge" data cells?
attack68 commented 2 years ago

pandas is a volunteer library. cline is not (yet) implemented in Styler.to_latex, no one has volunteered the time to develop it. toprule, midrule, and bottomrule will not help you here.

no, datacell merging is not possible. not sure what you mean by "two lines" but irrespective i am confident the fearures you are looking for have not been developed.

asapsmc commented 2 years ago

@attack68: Just to be clear I know that pandas is a volunteer library and I'm extremely grateful for pandas!! I was just checking if there was any option to accomplish what I wanted. About the 2 lines, I meant these 2 lines in the header: image

attack68 commented 2 years ago

those lines are the toprule and midrule. they are visible in both DF and Styler version.

In Styler version you can take them both away with hrules=False. you can also take just the midrule away by adding table styles for toprule and bottomrule and not including the midrule (and keeping Hrules=false)

asapsmc commented 2 years ago

@attack68 : sorry, I misguided you: I understand what you explained about the lines, but what I meant was "is there any way I can put the header in one row (instead of 2 rows) without losing the ability of the multiindex grouping under the "dataset" column?