rstudio / reticulate

R Interface to Python
https://rstudio.github.io/reticulate
Apache License 2.0
1.68k stars 328 forks source link

changing pandas dataframe display style in Rmarkdown #783

Open ofajardo opened 4 years ago

ofajardo commented 4 years ago

I would like to be able to change the display style of a pandas data frame, this code works in Jupyter, would be awesome to get it to work in R markdown. Currently it displays an incomplete version of the html string instead of the nicely formatted html table. Rmarkdown file attached.

dframe.Rmd.zip

---
title: "rawtest"
output: html_document
---

```{r setup, include=FALSE}
library(reticulate)
knitr::opts_chunk$set(echo = TRUE, error=TRUE)
use_python('/opt/conda/bin/python')

Displaying a pandas data frame nicely

OK we have a complicated pandas data frame and we want to show it nicely. Passing it to R and using kable or something like that is not an option because when passing a pandas dataframe with multi-index to R those indexes will dissapear. Let's start by displaying the dataframe:

import pandas as pd
ncols = 3
nrows = 3
row = list(range(1,ncols+1))
table = [row for x in range(nrows) ]
columns = [["","Overall"],["Transplant","false"],["Transplant","true"]]
rows = [["n", ""],["age","mean (SE)"],["age","median (IQR)"]]
custom_df = pd.DataFrame(table)
custom_df.columns = pd.MultiIndex.from_tuples(columns)
custom_df.index = pd.MultiIndex.from_tuples(rows)
custom_df.to_html()

OK not bad (what are those commas before and after the table btw?), but looks boring. Let's try to beautify with some CSS. OOPS, but the resulting html is not rendered, why?

# Let's apply some nice formatting to the dataframe

table_props = [('font-family', '"Arial", Arial, sans-serif;'),
                             ('font-size', '12pt;'),
                             ('border-collapse', 'collapse;'),
                             ('padding', '0px;'),
                             ('margin', '0px;'),
                             ('margin-bottom', '10pt;')]

tbody_props = [('background', '#fff')]

th_props = [
    ('border', '0;'),
    ('text-align', 'center;'),
    ('padding', '0.5ex 1.5ex;'),
    ('margin', '0px;')
    ]

# Set CSS properties for td elements in dataframe
td_props = [
    ('white-space', 'nowrap;'),
    ('border', '0;'),
    ('text-align', 'center;'),
    ('padding', '0.5ex 1.5ex;'),
    ('margin', '0px;')
    ]

tr_nthchild_props = [
    ('background', '#fff')
    ]

thead_first = [
    ('border-top', '2pt solid black;')
    ]

thead_last = [
    ('border-bottom', '1pt solid black;')
    ]

tr_last = [
    ('border-bottom', '2pt solid black;')
    ]

# Set table styles
styles = [
    dict(selector="table", props=table_props),
    dict(selector="tbody", props=tbody_props),
    dict(selector="th", props=th_props),
    dict(selector="td", props=td_props),
    dict(selector="tbody tr:nth-child(odd)", props=tr_nthchild_props),
    dict(selector="thead>tr:first-child>th", props=thead_first),
    dict(selector="thead>tr:last-child>th", props=thead_last),
    dict(selector="tbody>tr:last-child>td", props=tr_last),

    ]

styled = custom_df.style.set_table_styles(styles)
styled.render()
#styled_html = styled.render()
#styled_html.replace("</style><table", "</style>\n\n<!-- -->\n\n<table")
#styled_html
m-legrand commented 4 years ago

Even leaving aside the styling, there are two things I find interesting with this issue:

  1. (bug) results='asis' preserves the quotes of any Python output. This is making unnecessarily complicated to create HTML or Markdown directly from Python. These are the "commas" @ofajardo is seeing.

  2. (feature request (involving knitr?)) For a lot of pandas.DataFrame output, either of the following would often be better than the raw printing:

    • {python, results='asis'} df = ...; df.to_html() (assuming 1. is corrected)
    • {python, results='asis'} df = ...; df.to_markdown() (assuming 1. is corrected)
    • {python} df = ... + {r} py$df (cleanest result when no multi-index)

    I could see this getting much cleaner and customizable through an option somewhere, e.g. pandas.df.output being something like "repr" (default), "html", "markdown" or "r".

hathawayj commented 3 years ago

Did something change to align with this request? I can't get my pandas dataframes to just print output anymore in my markdown files. It always converts int to an HTML table unless I wrap a print() around it.

rleyvasal commented 3 years ago

It would be great if pandas data frames were shown nicely in Rmarkdown (R notebooks) same as they appear on Jupyter notebooks (or better, with an indicator of a datatype for each column). The only reason I don't use Rstudio for python is because I am not able to see the full data frames - not scrollable to left and right. This simple feature is very important for data exploration.

linogaliana commented 3 years ago

Would it be possible to change the class of pandas DataFrame returned from python and have some adapted methods for printing ?

When we do

```{python, echo = FALSE}
df = pd.DataFrame(
    {'size': [1.,1.5,1],
    'weight' : [3, 5, 2.5]
    },
    index = ['cat', 'dog', 'koala']
)
```

We end up with an object of class data.frame

```{r}
class(py$df)
# [1] "data.frame"
```

With an additional class, let's say dataframe.pandas, this would probably be easier to add some printing methods (e.g. print.dataframe.pandas.default, print.dataframe.pandas.html, print.dataframe.pandas.markdown) that would mimic, at R level (which would give R Markdown users more control on the output) the behavior of df.to_html or df.to_markdown.

kevinushey commented 1 year ago

If I understand correctly, this is an MRE:

---
title: "Pandas Printing"
author: "Kevin Ushey"
date: "`r Sys.Date()`"
output: html_document
---

```{r}
library(reticulate)
use_virtualenv("r-reticulate", required = TRUE)
py_install("pandas")
import pandas as pd

data = {
  'size': [1., 1.5, 1],
  'weight': [3, 5, 2.5]
}

pd.DataFrame(data, index = ['cat', 'dog', 'koala'])


When this document is rendered via `rmarkdown::render()`, you see:

<img width="577" alt="Screen Shot 2022-12-07 at 9 46 07 AM" src="https://user-images.githubusercontent.com/1976582/206067356-a7bc028e-d482-401e-9188-554a1ef5d128.png">

and so you don't get the nice HTML rendering for the Pandas DataFrame you might've hoped for.
kevinushey commented 1 year ago

This is where Pandas DataFrames get handled by the reticulate Python engine:

https://github.com/rstudio/reticulate/blob/a1d7f7f573f652212bc2c72c39317340e6d8b511/R/knitr-engine.R#L578

Note that we don't do anything here; we just use the captured (default) print style for the DataFrame. We considered using the to_html() method in the past, but the rendered HTML is pretty bare-bones and ugly.

Screen Shot 2022-12-07 at 9 52 31 AM

I'm not exactly sure what Jupyter is doing here when rendering DataFrames; presumedly they're using their own tooling for rendering to HTML? Or maybe they're using to_markdown() and letting the Markdown rendered produce a nice table?

linogaliana commented 1 year ago

Thanks @kevinushey for your detailed answer. In my case, moving to quarto solved the problem since, behind the stage, this means moving to juypter engine. I guess quarto now solves most of the cases expressed in this issue. The issue only remains for people mixing R and python in quarto or R Markdown

linogaliana commented 1 year ago

If it can help, in the past, jupyter was using this css to style the table. However, I have not been able to locate this styling in current jupyter version.

cscheid commented 1 year ago

I don't know how exactly Jupyter does it, but their output is equivalent to Display(Markdown(df.to_markdown())) (or whatever the IPython classes are). So I think that if reticulate could know that it's running inside knitr and output markdown in that case, then the style would match that of Jupyter.

That would mean, in turn, that quarto gets df printing behavior that is consistent across engines (which is the cause of our upstream issue)

cderv commented 1 year ago

As this came up again on Quarto side, I looked into this a bit. Here are some thoughts and insights

Quarto and R Markdown will do different styling, but at the end this is a matter of printing method to do at knitr step. Currently it is default priting, but it could be improved. AFAIU Jupyter (or nbclient or anything in the stack) registers some representation like text/html, text/markdown or text/latex and choose the one to use depending on the output format. At least Quarto leverages that from Jupyter output.

reticulate could do something similar to send information to knitr or do the choice itself based on knitr::pandoc_to() outputs. Easier with Quarto as outputing Markdown tables is the easiest because Quarto will do its processing and styling.

Documenting how to explicitly style a Pandas table using HTML(df.to_html()) could also be documented as this would be the way (with results: asis to do it explicitly with knitr).

this would probably be easier to add some printing methods (e.g. print.dataframe.pandas.default, print.dataframe.pandas.html, print.dataframe.pandas.markdown)

Going through this idea is also a good option for R Markdown.

@kevinushey @t-kalinowski hopes this helps. Happy to help make this better. We would love to have Jupyter and Knitr output for Python to be equivalent in Quarto ! (part of https://github.com/quarto-dev/quarto-cli/issues/3457)

Examples showing the different point mentioned above

Here are some tests I did with the rendering and different options with R Markdown https://rpubs.com/cderv/reticulate-rmarkdown-pandas-table-outputs

Rmd Source ````markdown --- title: "Pandas Printing" author: "Kevin Ushey" date: "`r Sys.Date()`" output: html_document --- ```{r} library(reticulate) use_virtualenv("r-reticulate", required = TRUE) py_install(c("pandas", "IPython", "tabulate")) ``` ```{python, echo=FALSE} import pandas as pd data = { 'size': [1., 1.5, 1], 'weight': [3, 5, 2.5] } df = pd.DataFrame(data, index = ['cat', 'dog', 'koala']) ``` # Default render ```{python} df ``` # Try HTML Some quote are still there preventing correct printing ```{python} df.to_html() ``` ```{python, results = "asis"} df.to_html() ``` So it requires some special processing ```{python} df_html = df.to_html() ``` ```{r, results='asis'} cat(py$df_html) ``` # Using IPython Display helps ```{python, results = "asis"} from IPython.display import HTML HTML(df.to_html()) ``` # Improve stylings using Bootstrap class ```{python} df_html = df.to_html(classes = ["table", "table_condensed"]) ``` ```{r, results='asis'} cat(py$df_html) ``` ```{python, results = "asis"} HTML(df.to_html(classes = ["table", "table_condensed"])) ``` # Try Markdown Still quoting, so it requires some special printing ```{python} df.to_markdown() ``` ```{python, results = "asis"} df.to_markdown() ``` ```{python} df_markdown = df.to_markdown() ``` ```{r, results = "asis"} cat(py$df_markdown) ``` # Using IPython Display helps ```{python, results = "asis"} from IPython.display import Markdown Markdown(df.to_markdown()) ``` ````

And same document in Quarto https://rpubs.com/cderv/reticulate-quarto-pandas-table-outputs

Qmd Source ````markdown --- title: "Pandas Printing" author: "Christophe Dervieux" date: today engine: knitr format: html: code-tools: source: true --- ```{r} library(reticulate) use_virtualenv("r-reticulate", required = TRUE) py_install(c("pandas", "IPython", "tabulate")) ``` ```{python} import pandas as pd data = { 'size': [1., 1.5, 1], 'weight': [3, 5, 2.5] } df = pd.DataFrame(data, index = ['cat', 'dog', 'koala']) ``` # Default render ```{python} df ``` # Try HTML Some quote are still there preventing correct printing ```{python} df.to_html() ``` ```{python} #| output: asis df.to_html() ``` So it requires some special processing ```{python} df_html = df.to_html() ``` ```{r} #| output: asis cat(py$df_html) ``` # Using IPython Display helps ```{python} #| output: asis from IPython.display import HTML HTML(df.to_html()) ``` # Improve stylings using Bootstrap class ```{python} df_html = df.to_html(classes = ["table", "table_condensed"]) ``` ```{r} #| output: asis cat(py$df_html) ``` ```{python} #| output: asis HTML(df.to_html(classes = ["table", "table_condensed"])) ``` # Try Markdown Still quoting, so it requires some special printing ```{python} df.to_markdown() ``` ```{python} #| output: asis df.to_markdown() ``` ```{python} df_markdown = df.to_markdown() ``` ```{r} #| output: asis cat(py$df_markdown) ``` # Using IPython Display helps ```{python} #| output: asis from IPython.display import Markdown Markdown(df.to_markdown()) ``` ````
cderv commented 1 year ago

Note that I understand now reticulate is catching Pandas DataFrame before any _repr_html_ or to_html can be used. https://github.com/rstudio/reticulate/blob/a1d7f7f573f652212bc2c72c39317340e6d8b511/R/knitr-engine.R#L576-L580

Regarding https://github.com/quarto-dev/quarto-cli/issues/3457, if the _repr_html method was called we would get the same output as in Jupyter with the raw HTML table produced , and Quarto would handle them the same.

I confirm that removing else if (inherits(value, "pandas.core.frame.DataFrame")) do get us the same output in Quarto than with Jupyter. Though as discussed before, in R Markdown it would require some additional CSS or processing to add the bootstrap class for table like it is done in R Markdown for Pandoc's table (and what Quarto is doing also)

dfalbel commented 1 year ago

I'm leaning towards changing reticulate to produce the Markdown representation when running trough knitr, witht this change table would be displayed like this in RMarkdown

Screenshot 2023-09-01 at 11 32 37

and

image

It would still not look exactly the same as in Quarto + Jupyter Engine, which is displayed like this:

image

The pro of this approach is that it only requires changing reticulate and no need for special handling from RMarkdown which I think can be tricky to coordinate. Do you think this a reasonable approach @cderv?

cderv commented 1 year ago

The pro of this approach is that it only requires changing reticulate and no need for special handling from RMarkdown which I think can be tricky to coordinate.

About this, I don't think anything is needed in rmarkdown or knitr in general for what *reticulate is doing. knitr is a toolbox for custom engine to use, and everything that reticulate does in a knitting context is defined inside reticulate. knitr only calls eng_python() when python chunk is seen and reticulate available.

So regarding this printing issue, this is only happening based on how reticulate decided to print content, possily in eng_python_autoprint(). This function decides when to output HTML or Markdown representation for tables (and does also other choices for other type of output)

Usually any issue reported as knitr issue but relevant to reticulate python engine are to be fixed in reticulate itself.

However, I may be missing something...

I'm leaning towards changing reticulate to produce the Markdown representation when running trough knit

I guess this would be fine to output Markdown table only. Quarto does parse Markdown tables through Pandoc and does a lot. but Quarto does parse also HTML table so it would be fine too (https://quarto.org/docs/authoring/tables.html)

I believe for Jupyter engine, Quarto will select the HTML output as I explained above: https://github.com/rstudio/reticulate/issues/783#issuecomment-1642358265 so possibly the output would be the same.

But regarding styling, this is only a matter of CSS. We can definitely fix that in Quarto to get the same styling.

Hope it helps.

Happy to discuss, help and test as needed.

joelostblom commented 3 months ago

I have come across this issue lately and think it would be really helpful if reticulate supported rich display of pandas data frames. In addition to what has already been mentioned above and the helpful documents shared by @cderv, I wanted to note that _repr_html_() respects pandas options such as the max number of rows to display and it also includes the following styling info that is not in to_html():

<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>