pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.25k stars 17.79k forks source link

CLN/API: implemented to_html in terms of .style #11700

Open jreback opened 8 years ago

jreback commented 8 years ago

Implement to_html / notebook repr based on .style.

prob need to expand this to take a use argument (to select the style, needs to be 'classic' for a while, to replicate the current .to_html one).

jorisvandenbossche commented 7 years ago

Some discussion related to this was going on in https://github.com/pandas-dev/pandas/pull/14975#issuecomment-269956133. Summarizing some elements here:

Barriers: some missing features are needed before such a replacement is possible (see also some elements in https://github.com/pandas-dev/pandas/issues/11610)

Advantages:

Disadvantages:

cc @TomAugspurger For basic html output / notebook repr, it would maybe be useful to have a base class that has a simpler template and does not support all the different customization methods? For example, I can imagine that leaving out all the id=.. (which are not needed for basic display I think?) can improve perf / simplify things.

TomAugspurger commented 7 years ago

For basic html output / notebook repr, it would maybe be useful to have a base class that has a simpler template and does not support all the different customization methods?

100% agree with your comments here. This wouldn't really be implementing df.to_html using .style. Instead we'd have a common Jinja2 template that would handle the logic of iterating over rows, inserting tags. Then .to_html() and .style would extend that base template. .to_html probably wouldn't change much from the base really.

Also, Jinja depends on MarkupSafe, so that becomes another dependency.

attack68 commented 3 years ago

Was there ever any progression on these ideas?

FYI the performance disadvantage above is much improved from 2017. 19.6s vs 2.7s, I now get about 3.9s versus 1.9s.

Also note #39951

moi90 commented 3 years ago

I don't agree with the advantage mentioned by @jorisvandenbossche: While I'm all for one convergent formatting system, a templating engine is not the solution. It just does not work for everything: As I said in #21673, there are other formats (like Excel) that can not (realistically) be built using a templating engine.

Also, I am not enthusiastic about making Jinja a hard dependence to render templates (for both HTML and LaTex, or anything else).

EDIT: My idea is that the various (styleable) *Formatters (HTMLFormatter, NotebookFormatter, ExcelFormatter, ...) should be extended to get the ability to optionally apply styles to their output (like I described in #21673).

toobaz commented 3 years ago

EDIT: My idea is that the various (styleable) *Formatters (HTMLFormatter, NotebookFormatter, ExcelFormatter, ...) should be extended to get the ability to optionally apply styles to their output

Isn't ExcelFormatter already used to do precisely this?

attack68 commented 3 years ago

I don't agree with the advantage mentioned by @jorisvandenbossche: While I'm all for one convergent formatting system, a templating engine is not the solution. It just does not work for everything:

I don't believe the objective here is to have one convergent system for everything, rather this post is about having one convergent formatting system for to_html, as opposed to Styler with jinja2 and DataFrame.to_html with HTMLFormatter.

jinja2 is a goto for python generating HTML due to packages like flask and Django, so if you are rendering HTML tables from pandas it is a logical combination, as well as the additional template extension flexibility it gives users, that HTMLFormatter cannot.

Since jinja2 is a dependency of Styler and if we assume that is not going away, then any Styler.to_latex method would have jinja2 available to it and some initial work done suggests this is quite easy to incorporate, or at least replicate the existing Dataframe.to_latex() functionality, without having, imo, the horrible subclassing of Formatters. https://github.com/pandas-dev/pandas/compare/master...attack68:latex_styler_mvp

toobaz commented 3 years ago

I'm conflicted. On one hand, it's nice to remove code. On the other, I'm not sure of how much code we would really save in exchange for a "stronger" dependency on jinja2. In #40344, you say that some of the arguments of to_html() (e.g. min_rowsint) are pointless because they are "related to console display"... but if the idea is that DataFrame.to_html() and Styler.to_html() are formatted with templates but not DataFrame._repr_html_(), then we are not really gaining much - we still need internal code to produce html for console display, right? And by the way, the fact that Styler._repr_html() does not truncate data like DataFrame._repl_html_() does should probably be considered a bug.

The possibility to export to other formats via jinja2 is also something potentially interesting but to be better investigated. While your attempt in https://github.com/pandas-dev/pandas/compare/master...attack68:latex_styler_mvp is cool, I suspect the complexity will increase quite a bit once we start supporting formatting (which won't use stuff like css), to the point that what jinja2 actually delivers is only a small part of the task of formatting to LaTeX.

I would be happy to be proven wrong though. How difficult would it be, in https://github.com/pandas-dev/pandas/pull/40312, to run the test suite with DataFrame.to_html() replaced with the jinja2 implementation, just to see what breaks?

moi90 commented 3 years ago

I don't believe the objective here is to have one convergent system for everything, rather this post is about having one convergent formatting system for to_html, as opposed to Styler with jinja2 and DataFrame.to_html with HTMLFormatter.

You're right if it is certain that HTMLFormatter can be completely removed. Is that the case? It seems not, guessing from @toobaz' comment.

attack68 commented 3 years ago

You're right if it is certain that HTMLFormatter can be completely removed. Is that the case? It seems not, guessing from @toobaz' comment.

@moi90 If the goal is to replicate all of the functionality from DataFrame.to_html() then yes it can be done and a lot has already been done in my wip pr. Not all though, because I wanted to raise the issue about simply blindly replicating a function which in some cases produces deprecated HTML, and instead consider the merits of making some changes perhaps with a view to pandas 2.0.

While your attempt in master...attack68:latex_styler_mvp is cool, I suspect the complexity will increase quite a bit once we start supporting formatting (which won't use stuff like css), to the point that what jinja2 actually delivers is only a small part of the task of formatting to LaTeX.

@toobaz I progressed the MVP to state where it now has a lot of general conditional styling capability for latex tables. See my response here I still want to be able to add some table level styles like column colouring or odd/even colouring but these are quite easy extensions.

I would be happy to be proven wrong though. How difficult would it be, in #40312, to run the test suite with DataFrame.to_html() replaced with the jinja2 implementation, just to see what breaks?

Quite easy, just need to redirect the method, when I push it I will ping you to take a look at test results.

attack68 commented 3 years ago

And by the way, the fact that Styler._repr_html() does not truncate data like DataFrame._replhtml() does should probably be considered a bug.

Actually I think the opposite. The docstring for _repr_html states it is mainly for Ipython / Jupyter, which has its own auto scrolling feature. I find it a real nuisance when pandas truncates my dataframes, so always revert to the default df.style display because it shows everything. If you want to view a dataframe in a console don't use a html represenatation, no?

toobaz commented 3 years ago

The docstring for _repr_html states it is mainly for Ipython / Jupyter, which has its own auto scrolling feature.

Sure, but passing the notebook a table with millions of rows will just make it crash, whether or not you scroll. We can discuss the optimal numer of rows to show (notice that you can easily customize it), but I'm afraid "no limit" is not an option.

If you want to view a dataframe in a console don't use a html represenatation, no?

Sure, the point is indeed about notebooks.

attack68 commented 3 years ago

Sure, but passing the notebook a table with millions of rows will just make it crash, whether or not you scroll. We can discuss the optimal numer of rows to show (notice that you can easily customize it), but I'm afraid "no limit" is not an option.

Do pandas set a limit of the size of a DataFrame you can construct, or is its limit just naturally determined by system constraints? Same logic could be argued here, albeit one is inside native python and the the other is rendering in external application like Jupyter in a browser (so error might not be as obvious)

I have seen multiple use cases of wanting to visualise large tables one is here with the other upto 20,000 rows. To be honest thats the largest I've seen so even if I'm not convinced a limit is necessary I think having one above that would not have affected any use case I have seen so far - and from memory that only took seconds to render, so would be happy with that.

toobaz commented 3 years ago

I have seen multiple use cases of wanting to visualise large tables one is here with the other upto 20,000 rows.

I regularly use tables with a couple of million rows inside Jupyter and it's great to see them easily. I would hate to crash my notebook every time I view them without thinking about truncating them. I'm sure many people use pandas with much larger databases. Again, I think deprecating the truncated visualization is not an option. I might be wrong on the need to truncate Styler too, however, so we can leave that option out of this discussion.

jorisvandenbossche commented 3 years ago

Indeed, removing truncation from the default html repr is currently not an option I think (unless we would use a more advanced widget that eg does that automatically, but that's another discussion). There are already settings to change the number of rows to show, if you want to change this as a user.

So if we want to replace the to_html/_repr_html_ with Styler, the truncation functionality will need to be added to Styler (although I don't think that Styler needs to do that by default).

attack68 commented 3 years ago

OK seems well supported, adding this to the list of things needed.

jorisvandenbossche commented 3 years ago

This wasn't really closed by #40312, which only added a Styler.to_html, and didn't implement the main to_html in terms of Styler

attack68 commented 2 years ago

In #45382 I'm proposing changing the signature of DataFrame.to_latex to:

DataFrame.to_latex(hide, format, format_index, render_kwargs)

and this will perform the following:

DataFrame.style.hide(**hide).format(**format).format_index(**format_index).to_latex(**render_kwargs)

This has the advantage of:

Is this reasonable and would it be appropriate to aim for something similar with to_html for v2.0?