rstudio / reticulate

R Interface to Python
https://rstudio.github.io/reticulate
Apache License 2.0
1.68k stars 328 forks source link

Matplotlib bar plots need `plt.show()`, line plots do not #1391

Closed matthew-brett closed 1 year ago

matthew-brett commented 1 year ago

I have the following file test.Rmd:

# A notebook file

```{python}
import matplotlib.pyplot as plt
```

```{python}
plt.bar([1, 2, 3], [2, 3, 1])
```

```{python}
plt.plot(range(10))
```

```{python}
plt.plot(range(0, -10, -1))
```

When I render this file with:

rmarkdown::render("test.Rmd", output_format = "pdf_document")

I get the following file: test.pdf

Oddly - the plt.plot commands each generate their own figure, without the need for plt.show(), as I was expecting. But the plt.bar() command does not generate a figure without plt.show(), and instead gets written to an as-yet hidden figure, then appended to by the subsequent plt.plot command.

This difference in behavior between plt.bar (also plt.hist) and plt.plot was surprising to me - and doesn't correspond to any behavior I know if in Matplotlib itself. Is it expected? Is there a way to work around it?

I'm using:

$ R --version
R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin22.4.0 (64-bit)

with just-built rmarkdown and reticulate.

t-kalinowski commented 1 year ago

Thanks for reporting! This should be fixed on main now. You can install the development version with:

remotes::install_github("rstudio/reticulate")
matthew-brett commented 1 year ago

Great - thanks! Sorry to be cheeky - but do you have any idea when the fix will come out in a release, ball-park?

matthew-brett commented 1 year ago

Ah - sorry - I see that the same problem still applies to plt.hist():

# A notebook file                                                                               

```{python}                                                                                     
import numpy as np                                                                              
import matplotlib.pyplot as plt                                                                 
```                                                                                             

```{python}                                                                                     
plt.hist(np.random.normal(size=1000))                                                           
```                                                                                             

```{python}                                                                                     
plt.plot(range(10))                                                                             
```                                                                                             

```{python}                                                                                     
plt.plot(range(0, -10, -1))                                                                     
``` 

I see this is because you are checking the return types from the plot commands - plt.hist returns an array and a BarContainer object.

There are quite a few other plot return objects. Ones I personally use often are plt.imshow (returns matplotlib.image.AxesImage) and 3D plots - returning e.g. [<mpl_toolkits.mplot3d.art3d.Line3D at 0x1043d6dd0>] - see the mpl 3D plot page.

I guess it will be relatively hard to track down all possible plot return types.

t-kalinowski commented 1 year ago

@matthew-brett Thanks again for reporting. plt.hist() and plt.imshow() should both render correctly now. As you encounter other matplotlib plot types, please reach out and let us know!

To answer your earlier question, we just submitted a release to CRAN so at the minimum it'll have to be at least 30 days before the next one.

matthew-brett commented 1 year ago

Will do (report) and thanks for the fix, and the information.

matthew-brett commented 1 year ago

Sorry to persist - and I guess this is a different issue - but I notice that a semi-colon at the end of the plt. line drops the output, even if the semi-colon is at the end of a line that is just modifying the plot. For example:

# A notebook file

```{python}
import numpy as np
import matplotlib.pyplot as plt
```

```{python}
plt.hist(np.random.normal(size=1000))
plt.title('A title');
```

```{python}
plt.hist(np.random.normal(size=1000));
```

This file does not generate either plot. That is inconvenient, because putting the semi-colon at the end of a plt. line is a common pattern in Jupyter notebooks, to suppress uninteresting output from plot commands at the end of the cell.

Is there any way to change that behavior?

Thanks again.

matthew-brett commented 1 year ago

Should I open a new issue for this? Just as an index of how common the semi-colon-at-end-of-plt-line is, my online Jupyter data science textbook has 370 lines matching "^plt\..*\);?$" (plt with or without semi-colon) and 110 matching "^plt\..*\);$" (plt with semi-colon). That's going be a large underestimate of the problem, because in many cases, there will be multiple plt lines in a cell, of which only the last will have a semi-colon - but it looks like that's enough to cause the problem here. So I suspect quite a lot more than half of the code cells with plt statements in them have this problem. I can do a little coding to work out the exact numbers if it would help.

t-kalinowski commented 1 year ago

The trailing semicolon is by design meant to suppress output. Following R/knitr conventions, reticulate interprets "autoprinting" of a matplotlib plot to mean "display the plot", (not the repr of the matplotlib artist class instance).

https://github.com/rstudio/reticulate/blob/558ec757b0893b4fa78b86ad9a124b0c7905bc94/R/knitr-engine.R#L192-L193

With the reticulate knitr engine, the ; is not needed to suppress output like Text(0.5, 1.0, 'A title'), that output is suppressed regardless. What ; suppresses is displaying the plot.

matthew-brett commented 1 year ago

Aha - so am I right in thinking that Reticulate itself has adopted the convention that, in Python code, the ; should suppress the output of a plot?

Testing now in RStudio - that doesn't seem to suppress the output of an R plot. But maybe it's a knitr convention in R?

But - in any case - as you know - the ; in Python code always and only means suppress the repr of the value. And - for exactly that reason - it's a very common pattern in Jupyter notebooks with plots. I have never come across the convention - in Jupyter - for ; to mean "suppress plots" - I'm sure partly because it would be so disruptive to that common pattern.

So I guess the question is - how important it is to y'all to make it possible to use Reticulate to render code in a way that is compatible with Jupyter? And conversely - how important is the convention of using ; to suppress plots in Python code? Because this convention will - for example - make it impractical to use Reticulate for our case, where we want to build the chapters nicely (with Reticulate) but also make the same code available as Jupyter notebooks for interaction.

t-kalinowski commented 1 year ago

I appreciate your suggestion to change reticulate to more-closely mimic jupyter's behavior here. However, I don't think it's a good idea to introduce a (breaking) change to reticulate that would, in practice, require users to end every python chunk that generates a plot with a trailing semicolon, just to match jupyter's design decision to do so. Overall I think it would lead to a worse experience for most users.

Two ideas come to mind for how you might maintain knitr/jupyter equivalency for the book:

a) Update all the chunks where matplotlib plot repr() output is suppressed to end with plt.show() instead of a semicolon. Example:

plt.hist(np.random.normal(size=1000))
plt.title('A title')
plt.show()

This might even have more didactic value for readers, at the expense of a little bit more code.

b) Direct quarto to use the jupyter rendering engine. Although you'd lose some knitr and reticulate features, you'd gain fidelity to jupyter's design choices.

stefanv commented 1 year ago

I'm trying to track down the origin of this decision, but all I can find is https://github.com/rstudio/reticulate/pull/722 which closed https://github.com/rstudio/reticulate/issues/717 without any discussion. Ironically, in 717 the reporter asked for behavior more similar to Jupyter notebooks.

t-kalinowski commented 1 year ago

I think the closest might be expressed in this comment: https://github.com/rstudio/reticulate/issues/717#issuecomment-586011449

Is calling plt.show() instead of ; a viable alternative for you?

stefanv commented 1 year ago

That will work, although it is not the convention to call plt.show() in notebooks. We could potentially post-process all our notebooks to remove those statements, I suppose.

Is there any way the reticulate behavior could be configurable? I.e., switch off / on / suppress-repr-only.

t-kalinowski commented 1 year ago

The most straightforward way to add this to your book might be to provide knitr a custom hook that wraps the reticulate knitr engine, like so:

```{r setup, echo = FALSE, output = FALSE}
knitr::knit_engines$set(python = function(options) {
  code <- options$code
  last_line <- code[[length(code)]]
  if(endsWith(last_line, ";")) {
    options$code[[length(code)]] <- sub(";$", "", last_line)
  }
  reticulate::eng_python(options)
})
stefanv commented 1 year ago

Thanks for the recommendation. That would get rid of the semi-colon, but would it give us the Jupyter-like behavior where the repr is suppressed but not the plot?

t-kalinowski commented 1 year ago

Yes, that is the default behavior.

matthew-brett commented 1 year ago

in practice, require users to end every python chunk that generates a plot with a trailing semicolon, just to match jupyter's design decision to do so. Overall I think it would lead to a worse experience for most users.

I'm sorry - I may have misinterpreted your meaning here - but Jupyter itself attaches no meaning the semi-colon, other than it's ordinary meaning from the dawn of time in interactive Python, which it to suppress the repr.

I actually don't know of any mechanism in Jupyter to suppress the plot - I haven't personally had cause to use that over the 10 years or so I have been using it.

So there's no requirement for a semi-colon - it just a common way to make the notebook neater, by suppressing the rather verbose reprs of the Matplotlib return objects.

But - and this just my lack of understanding of Reticulate's user-base - I guess there must be a set of users who are using Reticulate, who don't use Jupyter very much? And they have come to expect this different convention you have, that is alien to (and incompatible with) ordinary Jupyter usage? It's hard to imagine this default will not tend to put off experienced Jupyter users - but perhaps they are not your primary target audience?

As @stefanv said - having plot.show() at the end of every cell would be an anti-pattern, and one we really don't want our readers to see - experienced users effectively never use plt.show() in a notebook, because they are aware it is always unnecessary.

matthew-brett commented 1 year ago

And - sorry - just to clarify further - the plot only displays if the last line in the cell is a plotting command (without a semi-colon)? I notice for example that this cell does not generate a plot:

```{python}
plt.plot(range(10))
a = 1
```

So the full rule is - in order to get an automatic output of a plot, without plt.show, the cell needs to have:

?

Repeating Stefan's request - and also - I think the request of the original poster of https://github.com/rstudio/reticulate/issues/717#issuecomment-586011449 - I think you will get a lot of thanks for a configuration setting for Reticulate that makes it behave like Jupyter . So setting some option to reticulate.opt.jupyter_compat=TRUE or similar, would mean that Reticulate blocks would:

The use-case is just the one we have - and I suspect the other poster had - which is that we want to be able to use e.g. Quarto to write high-quality documents in Jupyter input format - meaning a format that would read correctly as idiomatic Jupyter code, and with the same outputs - and be readily exportable and usable as a Jupyter notebook. Of course we could switch to some other not-Reticulate, not-Knitr method, but as you say - then we lose the other, and major, benefits of Knitr, such as inline variables.

My guess is that you'd have a fairly large proportion of users in that situation - where they didn't care very much about compatibilty of behaviour between R chunks and Reticulate chunks, but they did care about compatibility with Jupyter.

matthew-brett commented 1 year ago

On reflection - maybe the best approach for now is to for us to make a Github fork to install from that does implement the Jupyter pattern. Of course we'll make sure, with the README, that no-one installs our fork by mistake, thinking it's the definitive Reticulate. But at least that way, we'll get and idea of how many people want that behavior, and therefore, whether there really is a need for this option.

t-kalinowski commented 1 year ago

I think this feature is something we can add to reticulate, I understand the need for it.

Making this work would require some refactoring of the main python knitr engine loop. Especially, detecting when it's appropriate to call plt.show(). A PR would be welcome!

Conceptually, it might be helpful to be aware of the mechanics with ggplot2, a plotting library in R. A ggplot2 object has a default print() method that displays the plot on the default graphics device. knitr auto-prints the return value from individual expressions by default, and for ggplot2 objects, that means, displays the plot. This is the behavior that the knitr python engine is modeled after--the final return value from a chunk is "auto-printed", which, for a matplotlib plot object, means, displayed. There is a slight deviation from standard R-knitr evaluation behavior because the auto printing only happens for the final expression in a python chunk, for reasons of practicality (to avoid displaying a plot-in-progress).

matthew-brett commented 1 year ago

Thanks - that's helpful.

I guess the key function is the one you pointed to before: https://github.com/rstudio/reticulate/blob/558ec757b0893b4fa78b86ad9a124b0c7905bc94/R/knitr-engine.R#L192-L193 . Would you mind pointing me to part of the code that stops Matplotlib output except as output from the last line of the cell?

matthew-brett commented 1 year ago

Nevermind - I worked out the code, to some degree - after a while. See PR #1394 - currently work-in-progress, but ready for review.