Open jreback opened 8 years ago
Some discussion related to this was going on in https://github.com/pandas-dev/pandas/pull/14975#issuecomment-269956133. Summarizing some elements here:
Barriers: some missing features are needed before such a replacement is possible (see also some elements in https://github.com/pandas-dev/pandas/issues/11610)
Advantages:
HTMLFormatter
, possibly other formatters) -> converging to one formatting systemDisadvantages:
df.style.render()
: 19.6 s vs df.to_html()
2.7 scc @TomAugspurger For basic html output / notebook repr, it would maybe be useful to have a base class that has a simpler template and does not support all the different customization methods? For example, I can imagine that leaving out all the id=..
(which are not needed for basic display I think?) can improve perf / simplify things.
For basic html output / notebook repr, it would maybe be useful to have a base class that has a simpler template and does not support all the different customization methods?
100% agree with your comments here. This wouldn't really be implementing df.to_html
using .style
.
Instead we'd have a common Jinja2 template that would handle the logic of iterating over rows, inserting tags.
Then .to_html()
and .style
would extend that base template. .to_html
probably wouldn't change much from the base really.
Also, Jinja depends on MarkupSafe, so that becomes another dependency.
Was there ever any progression on these ideas?
FYI the performance disadvantage above is much improved from 2017. 19.6s vs 2.7s, I now get about 3.9s versus 1.9s.
Also note #39951
I don't agree with the advantage mentioned by @jorisvandenbossche: While I'm all for one convergent formatting system, a templating engine is not the solution. It just does not work for everything: As I said in #21673, there are other formats (like Excel) that can not (realistically) be built using a templating engine.
Also, I am not enthusiastic about making Jinja a hard dependence to render templates (for both HTML and LaTex, or anything else).
EDIT: My idea is that the various (styleable) *Formatters (HTMLFormatter, NotebookFormatter, ExcelFormatter, ...) should be extended to get the ability to optionally apply styles to their output (like I described in #21673).
EDIT: My idea is that the various (styleable) *Formatters (HTMLFormatter, NotebookFormatter, ExcelFormatter, ...) should be extended to get the ability to optionally apply styles to their output
Isn't ExcelFormatter
already used to do precisely this?
I don't agree with the advantage mentioned by @jorisvandenbossche: While I'm all for one convergent formatting system, a templating engine is not the solution. It just does not work for everything:
I don't believe the objective here is to have one convergent system for everything, rather this post is about having one convergent formatting system for to_html
, as opposed to Styler
with jinja2 and DataFrame.to_html
with HTMLFormatter
.
jinja2
is a goto for python generating HTML due to packages like flask
and Django
, so if you are rendering HTML tables from pandas
it is a logical combination, as well as the additional template extension flexibility it gives users, that HTMLFormatter
cannot.
Since jinja2
is a dependency of Styler
and if we assume that is not going away, then any Styler.to_latex
method would have jinja2
available to it and some initial work done suggests this is quite easy to incorporate, or at least replicate the existing Dataframe.to_latex()
functionality, without having, imo, the horrible subclassing of Formatters. https://github.com/pandas-dev/pandas/compare/master...attack68:latex_styler_mvp
I'm conflicted. On one hand, it's nice to remove code. On the other, I'm not sure of how much code we would really save in exchange for a "stronger" dependency on jinja2
. In #40344, you say that some of the arguments of to_html()
(e.g. min_rowsint
) are pointless because they are "related to console display"... but if the idea is that DataFrame.to_html()
and Styler.to_html()
are formatted with templates but not DataFrame._repr_html_()
, then we are not really gaining much - we still need internal code to produce html for console display, right? And by the way, the fact that Styler._repr_html()
does not truncate data like DataFrame._repl_html_()
does should probably be considered a bug.
The possibility to export to other formats via jinja2
is also something potentially interesting but to be better investigated. While your attempt in https://github.com/pandas-dev/pandas/compare/master...attack68:latex_styler_mvp is cool, I suspect the complexity will increase quite a bit once we start supporting formatting (which won't use stuff like css), to the point that what jinja2
actually delivers is only a small part of the task of formatting to LaTeX.
I would be happy to be proven wrong though. How difficult would it be, in https://github.com/pandas-dev/pandas/pull/40312, to run the test suite with DataFrame.to_html()
replaced with the jinja2
implementation, just to see what breaks?
I don't believe the objective here is to have one convergent system for everything, rather this post is about having one convergent formatting system for to_html, as opposed to Styler with jinja2 and DataFrame.to_html with HTMLFormatter.
You're right if it is certain that HTMLFormatter
can be completely removed. Is that the case? It seems not, guessing from @toobaz' comment.
You're right if it is certain that
HTMLFormatter
can be completely removed. Is that the case? It seems not, guessing from @toobaz' comment.
@moi90 If the goal is to replicate all of the functionality from DataFrame.to_html()
then yes it can be done and a lot has already been done in my wip pr. Not all though, because I wanted to raise the issue about simply blindly replicating a function which in some cases produces deprecated HTML, and instead consider the merits of making some changes perhaps with a view to pandas 2.0.
While your attempt in master...attack68:latex_styler_mvp is cool, I suspect the complexity will increase quite a bit once we start supporting formatting (which won't use stuff like css), to the point that what jinja2 actually delivers is only a small part of the task of formatting to LaTeX.
@toobaz I progressed the MVP to state where it now has a lot of general conditional styling capability for latex tables. See my response here I still want to be able to add some table level styles like column colouring or odd/even colouring but these are quite easy extensions.
I would be happy to be proven wrong though. How difficult would it be, in #40312, to run the test suite with DataFrame.to_html() replaced with the jinja2 implementation, just to see what breaks?
Quite easy, just need to redirect the method, when I push it I will ping you to take a look at test results.
And by the way, the fact that Styler._repr_html() does not truncate data like DataFrame._replhtml() does should probably be considered a bug.
Actually I think the opposite. The docstring for _repr_html
states it is mainly for Ipython / Jupyter, which has its own auto scrolling feature. I find it a real nuisance when pandas truncates my dataframes, so always revert to the default df.style display because it shows everything. If you want to view a dataframe in a console don't use a html represenatation, no?
The docstring for
_repr_html
states it is mainly for Ipython / Jupyter, which has its own auto scrolling feature.
Sure, but passing the notebook a table with millions of rows will just make it crash, whether or not you scroll. We can discuss the optimal numer of rows to show (notice that you can easily customize it), but I'm afraid "no limit" is not an option.
If you want to view a dataframe in a console don't use a html represenatation, no?
Sure, the point is indeed about notebooks.
Sure, but passing the notebook a table with millions of rows will just make it crash, whether or not you scroll. We can discuss the optimal numer of rows to show (notice that you can easily customize it), but I'm afraid "no limit" is not an option.
Do pandas set a limit of the size of a DataFrame you can construct, or is its limit just naturally determined by system constraints? Same logic could be argued here, albeit one is inside native python and the the other is rendering in external application like Jupyter in a browser (so error might not be as obvious)
I have seen multiple use cases of wanting to visualise large tables one is here with the other upto 20,000 rows. To be honest thats the largest I've seen so even if I'm not convinced a limit is necessary I think having one above that would not have affected any use case I have seen so far - and from memory that only took seconds to render, so would be happy with that.
I have seen multiple use cases of wanting to visualise large tables one is here with the other upto 20,000 rows.
I regularly use tables with a couple of million rows inside Jupyter and it's great to see them easily. I would hate to crash my notebook every time I view them without thinking about truncating them. I'm sure many people use pandas with much larger databases. Again, I think deprecating the truncated visualization is not an option. I might be wrong on the need to truncate Styler
too, however, so we can leave that option out of this discussion.
Indeed, removing truncation from the default html repr is currently not an option I think (unless we would use a more advanced widget that eg does that automatically, but that's another discussion). There are already settings to change the number of rows to show, if you want to change this as a user.
So if we want to replace the to_html
/_repr_html_
with Styler
, the truncation functionality will need to be added to Styler
(although I don't think that Styler
needs to do that by default).
OK seems well supported, adding this to the list of things needed.
This wasn't really closed by #40312, which only added a Styler.to_html
, and didn't implement the main to_html
in terms of Styler
In #45382 I'm proposing changing the signature of DataFrame.to_latex
to:
DataFrame.to_latex(hide, format, format_index, render_kwargs)
and this will perform the following:
DataFrame.style.hide(**hide).format(**format).format_index(**format_index).to_latex(**render_kwargs)
This has the advantage of:
DataFrame.to_latex
since it passes the kwargs throughIs this reasonable and would it be appropriate to aim for something similar with to_html
for v2.0?
Implement
to_html
/ notebook repr based on.style
.prob need to expand this to take a
use
argument (to select the style, needs to be 'classic' for a while, to replicate the current.to_html
one).