pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.27k stars 17.8k forks source link

ENH: Styler.to_latex() #21673

Closed toobaz closed 2 years ago

toobaz commented 6 years ago

I have created a branch adding a to_latex method to Styler.

It is nowhere near readiness, and in particular:

This said, if anyone feels like experimenting, it might be slightly better than starting from scratch.

One aspect which I think is useful, even aside from this branch, and might benefit some discussion, is the pair of methods _latex_preserve and _latex_restore, which basically replace LaTeX commands so that they are not disturbed by escaping, and then restores them. There might be better way to code this, but I really think this is something we need to implement, and to offer to users who happen to nest LaTeX code in their cells.

MingweiSamuel commented 5 years ago

Would really appreciate this feature, thanks for making this issue

srossi93 commented 5 years ago

Any updates?

soumitrakp commented 4 years ago

+1 for this feature

KaleabTessera commented 4 years ago

+1

dorukhansergin commented 4 years ago

+1

th0ger commented 4 years ago

Does this enhancement aim to solve issues like this: <pandas.io.formats.style.Styler at 0x.......>? E.g. pd.style.hide_index() combined with jupyter nbconvert -to pdf

toobaz commented 4 years ago

Does this enhancement aim to solve issues like this:

Not sure what happens when you convert a notebook to pdf... but yes, it might be that to_latex() gets somehow called on (non-styled) DataFrames, and in that case, yes, fixing this issue would solve that too.

harakiricode commented 4 years ago

+1

yannpequignot commented 3 years ago

+1

teyden commented 3 years ago

+1

MarcoGorelli commented 3 years ago

Please, this isn't helpful - see here for how to contribute, else wait for someone else to do it, but carrying on commenting +1 is disruptive

toobaz commented 3 years ago

but carrying on commenting +1 is disruptive

(and no more effective than adding a simple "thumbs up" on the first comment)

cauebs commented 3 years ago

That branch is currently 8352 commits behind. Maybe you should try to rebase and open a PR, to see if it catches the attention of the maintainers.

toobaz commented 3 years ago

Maybe you should try to rebase and open a PR, to see if it catches the attention of the maintainers.

Who's "you"? If it's me, well, I am a maintainer, and that branch already catched my attention, but as of now it didn't result in me doing anything more :-)

Jokes apart, @cauebs feel free to try to rebase - not sure if it will help merging this eventually, but at least it will help making sure no changes in the last one and a half years broke my approach. In any case, as I wrote, that branch wasn't and wouldn't be ready for a PR.

moi90 commented 3 years ago

I would really like to see this added to Pandas!

What are the minimum requirements to get this merged? What would a sensible test case look like?

I don't think that it needs to support all of the current functionality before being merged, colored background and bold cells would be enough, imho.

toobaz commented 3 years ago

What are the minimum requirements to get this merged?

I think that more than any set of features, it is just doing it "right", which means

I think a PR satisfying these two points could be perfectly mergeable even with a really minimal set of supported formatting features.

moi90 commented 3 years ago

What is the problem with conversion to str? What happens in DataFrame.to_latex that needs to be duplicated?

(Sorry if this is obvious from the code, I didn't have a look yet. I'm just trying to get a feeling for the complexity of the problem.)

toobaz commented 3 years ago

What is the problem with conversion to str?

It's explained in my first comment ;-)

In addition, it's an obvious code duplication, since the process of converting cells content to strings is already implemented for DataFrame.to_latex().

moi90 commented 3 years ago

@toobaz Your branch also adds a Styler.to_html. Is this safe to remove for now?

moi90 commented 3 years ago

If I understand correctly, this is what we currently have:

"Styler.to_latex" -> "NDFrame.to_latex" -> "DataFrameRenderer.to_latex" -> "LatexFormatter.to_string" -> "TableBuilderAbstract.get_result";
toobaz commented 3 years ago

@toobaz Your branch also adds a Styler.to_html. Is this safe to remove for now?

Yes, I think the technique to be used for to_html are very similar to those for to_latex, but definitely there is no need to implement them together.

toobaz commented 3 years ago

If I understand correctly, this is what we currently have:

Yes, looks sound to me, with a .astype(str) along the first arrow. That first arrow is the problem, as NDFrame.to_latex does operations that we can (conceptually) split in two steps: prepare the data in each cell and convert it to string, finalize the entire table to a string. We need to rely on existing code for the first step, then apply our transformations (e.g. as implemented in my branch), and only then rely on existing code for the second step.

moi90 commented 3 years ago

Would it be viable to use DataFrameFormatter.format_col instead of astype(str) in Styler.to_latex? Then, the styles are applied to the otherwise readily formatted data. This would be a minimal change to the existing code and the development branch.

toobaz commented 3 years ago

Would it be viable to use DataFrameFormatter.format_col instead of astype(str) in Styler.to_latex?

Might be... I don't remember the internals well enough to judge, but it's relatively easy to check: if you use Styler.to_latex on a Styler without any formatting applied, then the output should be 100% identical to DataFrame.to_latex on the original called on the original DataFrame.

moi90 commented 3 years ago

Ok, there we would have the first test case 👍 Any other invariants that should be checked?

toobaz commented 3 years ago

Ok, there we would have the first test case +1 Any other invariants that should be checked?

That's one test case per DataFrame you run it on ;-) Every test in the test suite that involves DataFrame.to_latex() is a possible test of Styler.to_latex().

moi90 commented 3 years ago

Ok, I have a first success (please see https://github.com/moi90/pandas/pull/1). I changed test_to_latex so that each test runs with to_latex and style.to_latex. The next problem that I stumbled upon is that DataFrame.to_latex now thinks that all columns are strings columns and aligns them left.

moi90 commented 3 years ago

(If you want, I can create a WIP pull request if you think that this is a better place to discuss this.)

moi90 commented 3 years ago

I fixed further tests by instanciating DataFrameFormatter and LatexFormatter inside Style.to_latex.

It seems that it might make sense that DataFrame.to_latex calls DataFrame.style.to_latex internally. What do you think? (Although Styler would need some refactoring to do the HTML templating stuff only in the HTML case.)

attack68 commented 3 years ago

@moi90 I would take out the to_html addition. It complicates the PR and, in any case, I think we might want to replace the HTML formatter since Styler has its own (see https://github.com/pandas-dev/pandas/issues/11700). I will propose an alternate PR for that feature and see if it gains traction.

moi90 commented 3 years ago

@attack68 Yes, absolutely. (This work was started by @toobaz , not by me.)

I think Styler should become more general purpose and separate generic from HTML stuff.

moi90 commented 3 years ago

Also, HTML and LaTex are similar enough that they should be handled jointly (up to some point). Therefore, I propose not to proceed further (with #21673 AND #11700) until we have found and agreed upon a common solution for both.

I propose a common intermediate format that already has the abstract structure of a LaTex or HTML table but not yet the concrete styling (apart from number formatting). It would be based on what is currently produced by Styler._translate with one important difference: It includes both style and content of every cell. Also, it includes information about the columns (numeric|str and index|regular) (because in LaTex, these are aligned differently).

This intermediate format is then consumed by LaTex and HTML generators that translate it into actual LaTex and HTML code (with all their specialties like captions, labels, different styles and whatnot).

What do you think? Who should be involved to make a decision here?

toobaz commented 3 years ago

Who should be involved to make a decision here?

I'm afraid nobody will just say "sure, just go ahead" on a general refactoring of a lot of existing code (which currently works) before actually seeing what it looks like. Your plan might look very clear to you but it's not to a general reader. Feel free to try to put it in practice.

In particular

a common intermediate format that already has the abstract structure of a LaTex or HTML table but not yet the concrete styling (apart from number formatting). It would be based on what is currently produced by Styler._translate with one important difference: It includes both style and content of every cell.

If I understand correctly, your "intermediate format" knows that a cell has, say, {"background-color" : '#000000'}. Isn't this format more or less the definition of a Styler?!

Let me reiterate: the problem of my approach is not that it duplicates code between LaTeX and HTML: it is that it badly duplicates/tortures the LaTeX code. I'm not aware of a lot of code duplication between LaTeX and HTML (neither in Style, not in formatters for ordinary DataFrames).

By the way, I'd be happy to take a look at your attempt but it's difficult because of a bunch of formatting changes. Anyway, the problem you describe,

DataFrame.to_latex now thinks that all columns are strings columns and aligns them left.

is precisely the problem I described since the beginning: my approach converts everything to str and cannot work.

toobaz commented 3 years ago

It seems that it might make sense that DataFrame.to_latex calls DataFrame.style.to_latex internally. What do you think?

This is an interesting idea in principle, but I think not making DataFrame.to_latex significantly slower as a result is a priority. In other words, all actual formatting code should be avoided when a DataFrame is formatted. Ideally, I would rather think to a class that formats the DataFrame with hooks to insert formatting, and a subclass that formats the Styler by exploiting the hooks (see https://github.com/pandas-dev/pandas/issues/21673#issuecomment-792714009 ).

moi90 commented 3 years ago

If I understand correctly, your "intermediate format" knows that a cell has, say, {"background-color" : '#000000'}. Isn't this format more or less the definition of a Styler?!

Yes, exactly. Could you reiterate, what Styler._translate does, exactly? I find it hard to follow and there is no documentation of the output. It could be that it already is what I had in mind. If it is, we only need something that generates LaTex from it.

the problem of my approach is not that it duplicates code between LaTeX and HTML

My guess: There is no duplicated code because the same problem was solved in two totally different ways.

By the way, I'd be happy to take a look at your attempt but it's difficult because of a bunch of formatting changes.

I'm sorry. I formatted with an older version of black and the newer version (that is requested in the Contributing Guide) does not undo the changes in formatting. I will try to undo them manually.

Ideally, I would rather think to a class that formats the DataFrame with hooks to insert formatting, and a subclass that formats the Styler by exploiting the hooks.

Simply subclassing LatexFormatter won't work, because a concrete builder is selected according to the requested properties (longtable, caption, label, position). Somehow, the cell styles calculated by Styler._translate have to be merged with the cell values formatted by DataFrameFormatter.get_strcols, right?

Currently, DataFrameFormatter.get_strcols is consumed in io.formats.latex.RowStringConverter, so this might be the appropriate place to also inject the cell styles. One could add a new optional styler: Styler argument to RowStringConverter. (and the concrete implementations of GenericTableBuilder; LongTableBuilderThen, RegularTableBuilder, and TabularBuilder; and therefore LatexFormatter.) Then, RowStringConverter.get_strrow can call a to-be-implemented Styler.latex_style_strrow that applies the calculated cell styles:

# pandas/io/formats/latex.py:82
def get_strrow(self, row_num: int) -> str:
  ...
  crow = self._preprocess_row(row)
  if self.styler:
    crow = self.styler.latex_style_strrow(crow, row_num)
  ...

What do you think of that? Maybe I can start a new approach in this direction.

toobaz commented 3 years ago

Could you reiterate, what Styler._translate does, exactly?

Again, if I knew well the internals of LaTeX formatting, maybe my own PR wouldn't have sucked - unfortunately, it does. Plus, I didn't look at those internals in many months.

But I just recalled that a Styler is able to export itself to excel, and that it does by relying on the standard ExcelFormatter. Maybe there's a good model to follow there, also in terms of code organization.

attack68 commented 3 years ago

I know only the bare minimum of latex tables, especially when it comes to formatting, but Styler works by building a dict object, d, in _translate containing info on what goes in which cell, and then what are the CSS styles that need to be added. This dict is passed to a jinja2 template html.tpl which constructs the HTML from its values.

I see no reason why jinja2 could not be used to build a latex format, rather than an html format, although the _translate method currently contains code to optimise efficiency by leveraging the permissible structure of CSS language. I doubt this could be replicated for latex.

In terms of latex formatting 'color: red; border: 1px solid black;' in CSS language is mapped to [('color', 'red'), ('border', '1px solid black')]. I suspect that the same code-syntax can be used to generate latex attribute-value pairs that can be parsed in the template.

If you can provide the latex knowledge I can probably help guide with an integration to Styler.

moi90 commented 3 years ago

Styler is able to export itself to excel, and that it does by relying on the standard ExcelFormatter. Maybe there's a good model to follow there, also in terms of code organization.

That would be a nice thing, but the Styler.to_excel is an exact copy of NDFrame.to_excel.

@attack68 Yes, I too think, that both HTML and LaTex can (and probably should) be handled in a similar way. LaTex has no concept of applying CSS styles directly, but this is no problem. In fact @toobaz already wrote something that should be able to generate a nested sequence of appropriate formatting commands for each cell. The "only" question left is how to properly wire all this together without duplicating code.

I am not enthusiastic about requiring Jinja to render the template (for both HTML and LaTex, or anything else). Apart from that, there are other formats (like Excel) that can not (realistically) be built using a templating engine.

Therefore, I think that the various (styleable) *Formatters (HTMLFormatter, NotebookFormatter, ExcelFormatter, ...) should be extended to get the ability to optionally apply styles to their output (like I described above).

toobaz commented 3 years ago

That would be a nice thing, but the Styler.to_excel is an exact copy of NDFrame.to_excel.

... in the sense that both are just wrappers for ExcelFormatter.write()... which seems just an ideal model to follow?

But I'm not necessarily against the alternative of relying on jinja, I just don't know enough about it.

attack68 commented 3 years ago

The HTMLFormatter produces deprecated HTML, and is slower than Styler with jinja2 template, it also has far less functionality and can be pretty much replicated with Styler. The jinja2 language allows users full flexibility to subclass and easily edit the core structure, which is a nice feature. I have a WIP PR published that is suggesting replacing HTMLFormatter with Styler showing this.

I agree fully on Excel. I am not looking to replace the ExcelFormatter with jinja2 (I have no knowledge of excel file construction so don't know whether this is even doable or preferable)

I think latex falls somewhere in between an I don't quite know where yet. :)

Are you not enthusiastic from a perspective of having limited experience with jinja2 or the reason of thinking it is not flexible/clear enough?

Perhaps I can convince you... here is a c.10line MVP latex solution for templating within Styler: https://github.com/pandas-dev/pandas/compare/master...attack68:latex_styler_mvp

>>> pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=["A", "B", "C"]).style.render(latex=True)
\begin{table}
\centering
\begin{tabular}{llll}
 & A & B & C \\
\hline
0 & 1 & 2 & 3 \\
1 & 4 & 5 & 6 \\
\end{tabular}
\end{table}
attack68 commented 3 years ago

Actually I added a commit to my version slightly to show a possible re-use of Styler.set_table_styles and set_caption:

df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=["A", "B", "C"])
df = df.style
df = df.set_table_styles([{'selector': 'position', 'props': ':h'},
                          {'selector': 'float', 'props': ':raggedleft'},
                          {'selector': 'label', 'props': 'fig:mylabel'}])
df = df.set_caption("My caption")
df = df.render(latex=True)

\begin{table}[h]
\raggedleft
\begin{tabular}{llll}
 & A & B & C \\
\hline
0 & 1 & 2 & 3 \\
1 & 4 & 5 & 6 \\
\end{tabular}
\caption{My caption}
\label{fig:mylabel}
\end{table}
moi90 commented 3 years ago

Are you not enthusiastic from a perspective of having limited experience with jinja2 or the reason of thinking it is not flexible/clear enough?

I have used jinja2 and I'm convinced that it is able to produce latex code. However, I would only use it for LaTeX if the LatexFormatter will completely go away in turn, in order to avoid two implementations of (mostly) the same functionality. Moreover, there are Formatters that just can not be replaced by Jinja2 (like Excel) and therefore need their own mechanism of applying styles. So there has to be a way of injecting styles into Formatters anyway. It just does not make sense for me to maintain two entirely different approaches to apply styles to things.

Regarding your example, @attack68: The current DataFrame.to_latex has a whole bunch of parameters that affect the output (e.g. if caption, label or position are supplied, the tabular is wrapped in a table, like in your example, if not then not). How do you take these cases into account? Can you make sure that the columns are aligned as expected (text left, numbers right)? How does a user supply their own column_format?

attack68 commented 3 years ago

So there has to be a way of injecting styles into Formatters anyway.

I think when Styler was conceived the HTMLFormatter existed. Therefore there must have been some reason/decision to implement conditional formatting standalone outside of that Formatter, and I suspect it is because the Formatter is much more rigid and difficult to maintain, or not well suited to this task. There may be a way but history suggests it was not obvious to previous developers.

in order to avoid two implementations of (mostly) the same functionality

We already have two implementations to generate HTML. I do not believe that a suggestion to implement conditional styling into HTMLFormatter would gain any traction at all since Styler has now been around for years. So assuming that it is not going away tagging on latex to Styler functionality does not require much maintenance, whereas constructing the entire conditional cellular formttating into a Formatter does I think.

The current DataFrame.to_latex has a whole bunch of parameters that affect the output (e.g. if caption, label or position are supplied, the tabular is wrapped in a table,

These cases were taken into account via this commit in my branch.

How does a user supply their own column_format?

These two codes are now equivalent:

df.style.to_latex(caption='cap', label='fig:label', position='h!', column_format='lrc')
df.style.set_caption('cap').set_table_styles([
    {'selector': 'position', 'props': ':h!'},
    {'selector': 'label', 'props': 'fig:lab'},
    {'selector': 'column_format', 'props': ':lrc'}
]).to_latex()

Internally the first is just translated to the second, which leverages the core Styler functionality.

Styler has its own format() function so the other options formatters decimal float_format and na_rep would all be substituted by that. Incidentally calling df.style.format() first allows to visualise the impact of the formatting function before needing to render latex, which I think is useful.

I made a modification to my template and it now now has a syntax for parsing the usuas CSS and converting it to latex format, so:

df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=["A", "B", "C"])
s = df.style
s.highlight_max(axis=1, props='cellcolor:[rgb]{1,0,1};emph:;Huge:-wrap-;')
s.highlight_max(axis=None, props='textcolor:{yellow};')
s.highlight_min(axis=0, props='textbf:;Large:-wrap-;')
s.to_latex(hrules=True, column_format='lrcr')

\begin{tabular}{lrcr}
\toprule
 & A & B & C \\
\midrule
0 & \textbf{{\Large 1}} & \textbf{{\Large 2}} & \cellcolor[rgb]{1,0,1}{\emph{{\Huge \textbf{{\Large 3}}}}} \\
1 & 4 & 5 & \cellcolor[rgb]{1,0,1}{\emph{{\Huge \textcolor{yellow}{6}}}} \\
\bottomrule
\end{tabular}

Screen Shot 2021-03-11 at 13 53 49

moi90 commented 3 years ago

@attack68 OK, I understand your points.

I made a modification to my template and it now now has a syntax for parsing the usuas CSS and converting it to latex format

This looks really great! However, I'm not certain that your approach of using custom LaTex props is the best one. I would rather translate the CSS properties that are valid for HTML to proper LaTex markup commands (like @toobaz already did), e.g. font-size: Huge => {\Huge <text>}. This way you wouldn't need different props for HTML and LaTex. I'd be happy to contribute something like that. But I also acknowledge the flexibility in your approach to use any LaTex command. Maybe we could give the user the opportunity to (re)define styles?

attack68 commented 3 years ago

However, I'm not certain that your approach of using custom LaTex props is the best one. I would rather translate the CSS properties that are valid for HTML to proper LaTex markup commands (like @toobaz already did), e.g. font-size: Huge => {\Huge <text>}. This way you wouldn't need different props for HTML and LaTex. I'd be happy to contribute something like that. But I also acknowledge the flexibility in your approach to use any LaTex command. Maybe we could give the user the opportunity to (re)define styles?

So this is what I did at first actually just following the outline, but I quickly realised that:

1) Depending upon which latex package you used there might be different commands, i.e. multiple translations from CSS to latex, e.g. is font-style: italic or oblique has at least 3 variants in latex: \textit, \textsl, \emph, and I think this gets worse for colors.

2) It was more restrictive since to get anything to work in latex it first had to have a defined CSS translation rule for it.

3) It could be possible there is something in latex that does not translate directly from CSS, e.g. custom properties, or even if you define your own latex commands and want to insert the commands into cell formatting. (the equivalent of adding external CSS classes)

If you adopt the format I have provided it is easy to add in a patch that does what you want, i.e.:

def _parse_latex_cell_styles(styles: CSSList, display_value: str) -> str:
    styles = _parse_css_to_latex(styles)  #  <-- here is your patch function
    for style in styles[::-1]:  # in reverse for most recently applied style

where your function needs to convert say [('background-color', 'red')] to [('cellcolor', '{red}')], before the program continues.

Edit: I would rather leave this out for a basic PR, and maybe add into the functionality afterwards as a separate component,

moi90 commented 3 years ago

For the defaults (bold, italics, colored cell) it is pretty straight forward. (italics is textit; textsl is slanted, emph is semantic markup that's commonly rendered as italic, so no problem here). Moreover, some commands always have to be applied in a certain order. I'm all for the possibility of using custom LaTex commands, but I don't think this is the right way.

Maybe @toobaz can give his opinion about this?

moi90 commented 3 years ago

Edit: I would rather leave this out for a basic PR, and maybe add into the functionality afterwards as a separate component

This could be a good idea. However, if it is released this way, some solutions are infeasible in the future, if we don't want to break the API.

Would it be sensible to prefix latex-specific CSS attributes with a "vendor prefix", like -latex-Huge? This would also prevent name collisions between "real" CSS attributes and LaTex ones.

Also, there has to be a better way to define how the markup is applied. Different from your example, cellcolor is often used like this: \cellcolor[rgb]{1,0,1} value or {\cellcolor[rgb]{1,0,1} value} or even {\cellcolor[rgb]{1,0,1}} value, which is required for siunitx: https://tex.stackexchange.com/a/436148.

toobaz commented 3 years ago

Maybe @toobaz can give his opinion about this?

I would love to! But I'm confused. My understanding is that you are discussing whether the API should accept formatting in the format of css or of (arbitrary) LaTeX command (and I think I favor the former, and consider the latter a welcome extension). But then

I made a modification to my template and it now now has a syntax for parsing the usuas CSS and converting it to latex format, so:

... suggests that @attack68's current code already accepts both?

attack68 commented 3 years ago

My code currently uses latex structured in Styler's CSS format, i.e ('attr','value') tuple, basically forming a ('command',options') latex pattern. It currently expects that all input is in latex format. However, if it were given in CSS an optional converter could then convert this to the latex format for post-processing.

This could be a good idea. However, if it is released this way, some solutions are infeasible in the future, if we don't want to break the API.

This is a long way from release, but what solutions are infeasible? I believe that using a CSS map to Latex commands is far more restrictive and prohibitive, than what I have offered.

For example,

Different from your example, cellcolor is often used like this: \cellcolor[rgb]{1,0,1} value or {\cellcolor[rgb]{1,0,1} value} or even {\cellcolor[rgb]{1,0,1}} value, which is required for siunitx:

I have provided two variants: ('cellcolor', '[rgb]{1,0,1}') maps to '\cellcolor[rgb]{1,0,1){display_value}' ('cellcolor', '[rgb]{1,0,1}-wrap-') maps to '{\cellcolor[rgb]{1,0,1} display_value}'

More variants are easy to add in but at the risk of creating a parsing language.

@moi90 how would you be accounting for these variants without introducting your own parsing language in a pure CSS transformation solution?

attack68 commented 3 years ago

Previous discussion about this considered making two separate classes: html-Styler and a latex-Styler. In the case of the latter, you would not expect a latex-Styler to process CSS language, more so I would think you would expect the user to be inputting latex styling language.

This PR is essentially creating a latex-Styler from the pre-exsiting mechanics and unit tests for html-Styler.

To provide a translator between a html-Styler (in CSS) to a latex-Styler is a fairly easy extension, once the above is approved.

To allow a html-Styler (in CSS and latex) to a latex-Styler requires probably tagging those latex styles with -latex- as highlightlighted by @moi90, so that the above translator can add more functionality.

But this last step might be considered too esoteric and not worth inclusion by developers? maybe not.