quarto-dev / quarto-cli

Open-source scientific and technical publishing system built on Pandoc.
https://quarto.org
Other
3.98k stars 328 forks source link

qmd -> ipynb retains code lines only up to the first cell output #11558

Open coatless opened 1 day ago

coatless commented 1 day ago

Bug description

The output of code into "source" field is only capturing up to the first emitted line of the code cell under engine: knitr.

For instance, if we have:

```{r}
x <- 1 + 1
x
print("Hello R world!")
sapply(1:10, function(x) x^2)

In the `output.ipynb` JSON have:

````json
"source": [
  "x <- 1 + 1\n",
  "x"
],

with

"outputs": [
  {
    "output_type": "stream",
    "name": "stdout",
    "text": [
      "[1] 2"
    ]
  },
  {
    "output_type": "stream",
    "name": "stdout",
    "text": [
      "[1] \"Hello R world!\""
    ]
  },
  {
    "output_type": "stream",
    "name": "stdout",
    "text": [
      " [1]   1   4   9  16  25  36  49  64  81 100"
    ]
  }
],

Looking at the the native format, the results have been evaluated:

  , Div
      ( "" , [ "cell" ] , [] )
      [ CodeBlock
          ( "" , [ "r" , "cell-code" ] , [] ) "x <- 1 + 1\nx"
      , Div
          ( "" , [ "cell-output" , "cell-output-stdout" ] , [] )
          [ CodeBlock ( "" , [] , [] ) "[1] 2" ]
      , CodeBlock
          ( "" , [ "r" , "cell-code" ] , [] )
          "print(\"Hello R world!\")"
      , Div
          ( "" , [ "cell-output" , "cell-output-stdout" ] , [] )
          [ CodeBlock ( "" , [] , [] ) "[1] \"Hello R world!\"" ]
      , CodeBlock
          ( "" , [ "r" , "cell-code" ] , [] )
          "sapply(1:10, function(x) x^2)"
      , Div
          ( "" , [ "cell-output" , "cell-output-stdout" ] , [] )
          [ CodeBlock
              ( "" , [] , [] )
              " [1]   1   4   9  16  25  36  49  64  81 100"
          ]
      ]

So, I think what's happening is only the first CodeBlock is being sent to "source" while the others are being correctly condensed and, then, sent to "output". This is similar for Python code as well.

On a side note, is there a way to have engine: jupyter evaluate code from a Qmd? The evaluate: false doesn't seem to be able to be overridden and the authoring within qmd doesn't seem to retain cell results in the output.

https://quarto.org/docs/computations/r.html#disabling-execution

Steps to reproduce

qmd -> ipynb

---
title: "Quarto Conversion"
format: jupyter
engine: knitr
---

# Overview

Testing a Quarto an R + Python notebook

## Active cells

```{r}
x <- 1 + 1
x
print("Hello R world!")
sapply(1:10, function(x) x^2)
y = 1 + 6
print(y)
print("Hello Python world!")
[x**2 for x in range(10)]

Inactive cells

print("Goodbye R world!")
2 + 2

sapply(1:10, function(x) x^4)
print("Hello Python world!")
print([x**2 for x in range(10)])
8 + 5

<details>
<summary> ipynb format full JSON output </summary>

```json
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Quarto Conversion\n",
        "\n",
        "# Overview\n",
        "\n",
        "Testing a Quarto an R + Python notebook\n",
        "\n",
        "## Active cells"
      ],
      "id": "4f5a8b04-1046-41f8-9837-604ebaa0db0a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[1] 2"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[1] \"Hello R world!\""
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            " [1]   1   4   9  16  25  36  49  64  81 100"
          ]
        }
      ],
      "source": [
        "x <- 1 + 1\n",
        "x"
      ],
      "id": "da39a8ca-a365-49e1-8e18-e465c57b33d7"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "7"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Hello Python world!"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]"
          ]
        }
      ],
      "source": [
        "y = 1 + 6\n",
        "print(y)"
      ],
      "id": "f2d24e33-fbe5-48e1-a4b6-18d181937c61"
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Inactive cells\n",
        "\n",
        "``` r\n",
        "print(\"Goodbye R world!\")\n",
        "2 + 2\n",
        "\n",
        "sapply(1:10, function(x) x^4)\n",
        "```\n",
        "\n",
        "``` python\n",
        "print(\"Hello Python world!\")\n",
        "print([x**2 for x in range(10)])\n",
        "8 + 5\n",
        "```"
      ],
      "id": "d346e134-b945-4ad8-beea-cbf09a0b8340"
    }
  ],
  "nbformat": 4,
  "nbformat_minor": 5,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  }
}

qmd -> native

---
title: "Quarto Conversion"
format: native
engine: knitr
---

# Overview

Testing a Quarto an R + Python notebook

## Active cells

```{r}
x <- 1 + 1
x
print("Hello R world!")
sapply(1:10, function(x) x^2)
y = 1 + 6
print(y)
print("Hello Python world!")
[x**2 for x in range(10)]

Inactive cells

print("Goodbye R world!")
2 + 2

sapply(1:10, function(x) x^4)
print("Hello Python world!")
print([x**2 for x in range(10)])
8 + 5

<details>
<summary> Native output </summary>

```lua
Pandoc
  Meta
    { unMeta =
        fromList
          [ ( "title"
            , MetaInlines [ Str "Quarto" , Space , Str "Conversion" ]
            )
          ]
    }
  [ Header 1 ( "overview" , [] , [] ) [ Str "Overview" ]
  , Para
      [ Str "Testing"
      , Space
      , Str "a"
      , Space
      , Str "Quarto"
      , Space
      , Str "an"
      , Space
      , Str "R"
      , Space
      , Str "+"
      , Space
      , Str "Python"
      , Space
      , Str "notebook"
      ]
  , Header
      2
      ( "active-cells" , [] , [] )
      [ Str "Active" , Space , Str "cells" ]
  , Div
      ( "" , [ "cell" ] , [] )
      [ CodeBlock
          ( "" , [ "r" , "cell-code" ] , [] ) "x <- 1 + 1\nx"
      , Div
          ( "" , [ "cell-output" , "cell-output-stdout" ] , [] )
          [ CodeBlock ( "" , [] , [] ) "[1] 2" ]
      , CodeBlock
          ( "" , [ "r" , "cell-code" ] , [] )
          "print(\"Hello R world!\")"
      , Div
          ( "" , [ "cell-output" , "cell-output-stdout" ] , [] )
          [ CodeBlock ( "" , [] , [] ) "[1] \"Hello R world!\"" ]
      , CodeBlock
          ( "" , [ "r" , "cell-code" ] , [] )
          "sapply(1:10, function(x) x^2)"
      , Div
          ( "" , [ "cell-output" , "cell-output-stdout" ] , [] )
          [ CodeBlock
              ( "" , [] , [] )
              " [1]   1   4   9  16  25  36  49  64  81 100"
          ]
      ]
  , Div
      ( "" , [ "cell" ] , [] )
      [ CodeBlock
          ( "" , [ "python" , "cell-code" ] , [] )
          "y = 1 + 6\nprint(y)"
      , Div
          ( "" , [ "cell-output" , "cell-output-stdout" ] , [] )
          [ CodeBlock ( "" , [] , [] ) "7" ]
      , CodeBlock
          ( "" , [ "python" , "cell-code" ] , [] )
          "print(\"Hello Python world!\")"
      , Div
          ( "" , [ "cell-output" , "cell-output-stdout" ] , [] )
          [ CodeBlock ( "" , [] , [] ) "Hello Python world!" ]
      , CodeBlock
          ( "" , [ "python" , "cell-code" ] , [] )
          "[x**2 for x in range(10)]"
      , Div
          ( "" , [ "cell-output" , "cell-output-stdout" ] , [] )
          [ CodeBlock
              ( "" , [] , [] ) "[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]"
          ]
      ]
  , Header
      2
      ( "inactive-cells" , [] , [] )
      [ Str "Inactive" , Space , Str "cells" ]
  , CodeBlock
      ( "" , [ "r" ] , [] )
      "print(\"Goodbye R world!\")\n2 + 2\n\nsapply(1:10, function(x) x^4)"
  , CodeBlock
      ( "" , [ "python" ] , [] )
      "print(\"Hello Python world!\")\nprint([x**2 for x in range(10)])\n8 + 5"
  ]

Expected behavior

The source portion of the ipynb should list all of the code; not just the first code blocks after a cell is evaluated.

Actual behavior

All output was retained; but source contained only the lines inside of the cell up to the first line of code that output.

Your environment

Quarto check output

Quarto 1.6.37
[✓] Checking environment information...
      Quarto cache location: /Users/ronin/Library/Caches/quarto
[✓] Checking versions of quarto binary dependencies...
      Pandoc version 3.4.0: OK
      Dart Sass version 1.70.0: OK
      Deno version 1.46.3: OK
      Typst version 0.11.0: OK
[✓] Checking versions of quarto dependencies......OK
[✓] Checking Quarto installation......OK
      Version: 1.6.37
      Path: /Applications/quarto/bin

[✓] Checking tools....................OK
      TinyTeX: (not installed)
      Chromium: (not installed)

[✓] Checking LaTeX....................OK
      Using: Installation From Path
      Path: /opt/homebrew/bin
      Version: undefined

[✓] Checking basic markdown render....OK

[✓] Checking Python 3 installation....OK
      Version: 3.12.7
      Path: /Users/ronin/Documents/GitHub/quarto/colab/build/bin/python3
      Jupyter: 5.7.2
      Kernels: ir, python3

[✓] Checking Jupyter engine render....OK

[✓] Checking R installation...........OK
      Version: 4.4.2
      Path: /Library/Frameworks/R.framework/Resources
      LibPaths:
        - /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
      knitr: 1.49
      rmarkdown: 2.29

[✓] Checking Knitr engine render......OK
cderv commented 20 hours ago

I am surprised that your YAML header works for you 🤔

---
title: "Quarto Conversion"
format: jupyter
engine: knitr
---

I get this error because format: jupyter is not among the supported format

ERROR: Unknown format jupyter

Do you mean format: ipynb ?

coatless commented 20 hours ago

Nope, I still end up with a cell missing under format: ipynb.

Image Image

cderv commented 18 hours ago

still end up with a cell missing under format: ipynb.

I did not say it was not the case. I was surprised by format: jupyter and wanted to clarify. Let me check what happens. It could also be Pandoc conversion missing something... Pandoc is the one doing the ipynb writing when --to ipynb is used.

cderv commented 18 hours ago

It could also be Pandoc conversion missing something... Pandoc is the one doing the ipynb writing when --to ipynb is used.

It is not. It is something we do to handle the specific content from the intermediate markdown 🤔 I'll try to find what.

On a side note, is there a way to have engine: jupyter evaluate code from a Qmd? The evaluate: false doesn't seem to be able to be overridden and the authoring within qmd doesn't seem to retain cell results in the output.

What do you mean exactly ? By default python code cell will be executed when rendering in a .qmd. Do I understand correctly you want to disabled execution by default, and enable it manually ?

We had a past discussion on eval: false and also execute.enabled option. Maybe it has elements that could help avoid confusion. It is still open BTW

cderv commented 16 hours ago

It is something we do to handle the specific content from the intermediate markdown 🤔 I'll try to find what.

We do some processing in https://github.com/quarto-dev/quarto-cli/blob/5645bad012374ebae87c8e6db9aade0e088a5d24/src/resources/filters/quarto-post/ipynb.lua

and also some post processing https://github.com/quarto-dev/quarto-cli/blob/37bc223282d9e239d1245575100c7dbf3de74a52/src/format/ipynb/format-ipynb.ts#L80-L145

from debugging the post processing, it does not come from there it seems as the ipynb is already missing the source part.

So it comes from Lua. From looking at trace json, I think our Lua processing does end to something similar to this

---
title: "Quarto Conversion"
format: ipynb
keep-md: true
engine: knitr
---

# Overview

Testing a Quarto an R + Python notebook

## Active cells

::: {.cell .code}

```{.r}
x <- 1 + 1
x

::: {.output .stream .stdout}

[1] 2

:::

print("Hello R world!")

::: {.output .stream .stdout}

[1] "Hello R world!"

:::

sapply(1:10, function(x) x^2)

::: {.output .stream .stdout}

 [1]   1   4   9  16  25  36  49  64  81 100

::: :::


<details>
<summary>Last from last Doc AST in trace.json</summary>

blocks:

And if we do render this with pandoc directly we can reproduce

❯ quarto pandoc --to ipynb index.md
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Overview\n",
    "\n",
    "Testing a Quarto an R + Python notebook\n",
    "\n",
    "## Active cells"
   ],
   "id": "f24ca080-5328-4da4-b395-e74868228781"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[1] 2"
     ]
    },
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "[1] \"Hello R world!\""
     ]
    },
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      " [1]   1   4   9  16  25  36  49  64  81 100"
     ]
    }
   ],
   "source": [
    "x <- 1 + 1\n",
    "x"
   ],
   "id": "4d93688e-66f0-4437-ba4e-285e98bc7062"
  }
 ],
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {}
}

So this is something related to Pandoc and Jupyter conversion with Code block

From their doc: https://pandoc.org/MANUAL.html#jupyter-notebooks

When creating a Jupyter notebook, pandoc will try to infer the notebook structure. Code blocks with the class code will be taken as code cells, and intervening content will be taken as Markdown cells.

I think this could be because we do put the code class only to the outer div cell and so only first codeblock is taken into account and not the following one.

We do that here https://github.com/quarto-dev/quarto-cli/blob/37bc223282d9e239d1245575100c7dbf3de74a52/src/resources/filters/quarto-post/ipynb.lua#L66-L73

So we may need to revisit how we do tweak the AST for the ipynb output.

coatless commented 13 hours ago

@cderv If it's helpful, here's what I developed to get around the issue:

https://github.com/coatless-quarto/colab/blob/b913049e278c485dc239b00b63d2859b8eab2e4f/_extensions/colab/patch-blocks.lua

On a side note, is there a way to have engine: jupyter evaluate code from a Qmd? The evaluate: false doesn't seem to be able to be overridden and the authoring within qmd doesn't seem to retain cell results in the output.

What do you mean exactly ? By default python code cell will be executed when rendering in a .qmd. Do I understand correctly you want to disabled execution by default, and enable it manually ?

We had a past discussion on eval: false and also execute.enabled option. Maybe it has elements that could help avoid confusion. It is still open BTW

Thanks! Though, I'm not sure this fully addresses it. This side question came up as I was running into an odd issue when only R actively, e.g. {r}. Maybe this should be split into a different issue?

To illustrate with a few modifications to the above document (removing active Python , e.g. {python} code cells), I have:

---
title: "Quarto Conversion"
format: 
  ipynb: default
engine: jupyter
---

# Overview

Testing a Quarto an R notebook backed by Jupyter

## Active cells

```{r}
x <- 1 + 1
x
print("Hello R world!")
sapply(1:10, function(x) x^2)
mm <- 2 + 2

But, my native output gives:

```lua
  , CodeBlock
      ( "" , [ "{r}" ] , [] )
      "x <- 1 + 1\nx\nprint(\"Hello R world!\")\nsapply(1:10, function(x) x^2)"
  , CodeBlock ( "" , [ "{r}" ] , [] ) "mm <- 2 + 2"

whereas with Python it would be:

Div
      ( "" , [ "cell" ] , [ ( "execution_count" , "1" ) ] )
      [ CodeBlock
          ( "" , [ "python" , "cell-code" ] , [] )
          "y = 1 + 6\nprint(y)\nprint(\"Hello Python world!\")\n[x**2 for x in range(10)]"
      , Div
          ( "" , [ "cell-output" , "cell-output-stdout" ] , [] )
          [ CodeBlock ( "" , [] , [] ) "7\nHello Python world!" ]
      , Div
          ( ""
          , [ "cell-output" , "cell-output-display" ]
          , [ ( "execution_count" , "2" ) ]
          )
          [ CodeBlock
              ( "" , [] , [] ) "[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]"
          ]
      ]

P.S. How'd you generate the doc tree as YAML in Doc AST in trace.json? That's really convenient.

cderv commented 13 minutes ago

This side question came up as I was running into an odd issue when only R actively, e.g. {r}. Maybe this should be split into a different issue?

Yes it should be in another discussion thread. Let me just answer here for now one last time, and we'll move in another issue with your new question on this.

To illustrate with a few modifications to the above document (removing active Python , e.g. {python} code cells), I have:

You kept engine: jupyter but only r code cell. This means Jupyter will try to use a kernel where language r is available or use the default one which is python. (I see now that we don't message anything about that before rendering) In the later case, the cell that are not language python won't be evaluated. And this is probably why you observe the native output like this.

If you use python cell, they will be executed by the kernel, and why the different in native output.

Jupyter is mono language by default, and meant for python code cell, unless you specific a kernel that support multi language or the specific language.

I hope this clarfies.

How'd you generate the doc tree as YAML in Doc AST in trace.json? That's really convenient.

This is a debug technique we mention in https://quarto.org/docs/troubleshooting/#debugging-lua-filters You can set an env var to a JSON file, and we have a viewer tool in source repo that allow to investigate lua filters. We don't have yet a blog post about this in details.