quarto-dev / quarto-cli

Open-source scientific and technical publishing system built on Pandoc.
https://quarto.org
Other
3.57k stars 294 forks source link

Quarto rendering breaks on nested jupyter cell metadata #9089

Open JanPalasek opened 3 months ago

JanPalasek commented 3 months ago

Bug description

When I export Jupyter Notebook created from DataBricks notebook, the output shows metadata for some reason.

Note that the problem is fixed by clearing metadata in the cells.

Steps to reproduce

Content of the ipynb file ```json { "cells": [ { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "493a7a31-3e86-4d2b-813c-951fc7593ea3", "showTitle": false, "title": "" } }, "source": [ "# DataBricks Notebooks\n", "\n", "## Introduction\n", "\n", "In this notebook, we try Quarto with DataBricks.\n", "\n", "## Chapter\n", "\n", "In the first chapter, we try multiple commands and observe their results." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "7039bc23-d898-4506-b24d-8f1002a66d18", "showTitle": false, "title": "" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "output_type": "stream", "text": [ "+--------------------+---------------------+-------------+-----------+----------+-----------+\n|tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|fare_amount|pickup_zip|dropoff_zip|\n+--------------------+---------------------+-------------+-----------+----------+-----------+\n| 2016-02-14 16:52:13| 2016-02-14 17:16:04| 4.94| 19.0| 10282| 10171|\n| 2016-02-04 18:44:19| 2016-02-04 18:46:00| 0.28| 3.5| 10110| 10110|\n| 2016-02-17 17:13:57| 2016-02-17 17:17:55| 0.7| 5.0| 10103| 10023|\n| 2016-02-18 10:36:07| 2016-02-18 10:41:45| 0.8| 6.0| 10022| 10017|\n| 2016-02-22 14:14:41| 2016-02-22 14:31:52| 4.51| 17.0| 10110| 10282|\n+--------------------+---------------------+-------------+-----------+----------+-----------+\nonly showing top 5 rows\n\n" ] } ], "source": [ "df = spark.read.table(\"samples.nyctaxi.trips\")\n", "df.show(5)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "3754cc12-1f1f-405e-8525-b923f1b42a1e", "showTitle": false, "title": "" } }, "source": [ "This is text in-between the commands." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "21c1cb83-83cc-40c8-9a8b-f5378d3f29be", "showTitle": false, "title": "" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from databricks.sdk.runtime import dbutils\n", "dbutils.fs.ls(\"dbfs:/Workspace/Users/\")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "d5627e1f-aecf-486b-b6f5-0eb6e3144e6d", "showTitle": false, "title": "" } }, "source": [ "## Conclusion\n", "\n", "Currently, Quarto does not fully work, at least not rendering." ] } ], "metadata": { "application/vnd.databricks.v1+notebook": { "dashboards": [], "language": "python", "notebookMetadata": { "pythonIndentUnit": 4 }, "notebookName": "dbx_exploration", "widgets": {} } }, "nbformat": 4, "nbformat_minor": 0 } ```

Expected behavior

No metadata present in the html output.

Actual behavior

image

Metadata are visible in the output report. I believe quarto should ignore cell metadata that it does not know but maybe I'm wrong.

Your environment

Quarto check output

Quarto 1.4.551 [✓] Checking versions of quarto binary dependencies... Pandoc version 3.1.11: OK Dart Sass version 1.69.5: OK Deno version 1.37.2: OK [✓] Checking versions of quarto dependencies......OK [✓] Checking Quarto installation......OK Version: 1.4.551 Path: /opt/quarto/1.4.551/bin

[✓] Checking tools....................OK TinyTeX: (not installed) Chromium: (not installed)

[✓] Checking LaTeX....................OK Tex: (not detected)

[✓] Checking basic markdown render....OK

[✓] Checking Python 3 installation....OK Version: 3.11.8 Path: /usr/local/bin/python3 Jupyter: 5.7.2 Kernels: python3

(-) Checking Jupyter engine render....[IPKernelApp] WARNING | Unknown error in handling startup files: [✓] Checking Jupyter engine render....OK

[✓] Checking R installation...........(None)

cscheid commented 3 months ago

Can you please share the notebook? It's hard for us to go from a screenshot alone.

mcanouil commented 3 months ago

It's shared as json in a details HTML tag under the "step to reproduce" section ;)

edit: I can reproduce with the file, but I cannot execute the cells as the document is itself not at all reproducible nor self-contained.

image

@JanPalasek Would you mind sharing a Notebook that is self-contained and does not require external files. Right now, we can't execute the Notebook.

cscheid commented 3 months ago

@JanPalasek Sorry I misread your message!

@mcanouil if the output is present in the notebook we shouldn't need any more repro than that.

mcanouil commented 3 months ago

@mcanouil if the output is present in the notebook we shouldn't need any more repro than that.

I am not very familiar with Notebooks ways of working, and that the metadata in question had something to do with computation itself. So being able to execute would have answer this question.

cscheid commented 3 months ago

I am not very familiar with Notebooks ways of working, and that the metadata in question had something to do with computation itself. So being able to execute would have answer this question.

By design, quarto allows users to render "computed" .ipynb files. This is necessary to support workflows where .ipynb files are created in computing environments different from those of rendering environments. Execution in google colab or huggingface spaces or other high-performance computing environments is most often how this happens. It's a common Jupyter workflow that we've decided to support.

As a result, we don't need to compute to see the bug. This report is reproducible: you've even reproduced it! It's a valid .ipynb file that when we call quarto render file.ipynb --to html --no-execute produces broken HTML.

cscheid commented 3 months ago

This has to do with quarto getting confused at the metadata cell being emitted by databricks having hierarchical keys. Consider the output of quarto convert on that file:

$ quarto convert 9089.ipynb
...
$ cat 8089.qmd
---
title: DataBricks Notebooks
jupyter: python3
---

## Introduction

In this notebook, we try Quarto with DataBricks.

## Chapter

In the first chapter, we try multiple commands and observe their results.

```{python}
#| application/vnd.databricks.v1+cell: {cellMetadata: {byteLimit: 2048000, rowLimit: 10000}, inputWidgets: {}, nuid: 7039bc23-d898-4506-b24d-8f1002a66d18, showTitle: false, title: ''}
df = spark.read.table("samples.nyctaxi.trips")
df.show(5)

...


That nested YAML entry is going to eventually confuse Pandoc mightily.

From there, edit the .qmd file to add `keep-md: true` and, and render it like so:

$ quarto render 9089.qmd --no-execute --to html


The `.html.md` file now points to where things are going to go bad:

$ cat 9089.html.md

title: DataBricks Notebooks keep-md: true

Introduction

In this notebook, we try Quarto with DataBricks.

Chapter

In the first chapter, we try multiple commands and observe their results.

::: {#7474b8c9 .cell application/vnd.databricks.v1+cell='{"cellMetadata":{"byteLimit":2048000,"rowLimit":10000},"inputWidgets":{},"nuid":"7039bc23-d898-4506-b24d-8f1002a66d18","showTitle":false,"title":""}'}

df = spark.read.table("samples.nyctaxi.trips")
df.show(5)

::: ...



And now we can plainly see the problem.
cscheid commented 3 months ago

Fundamentally, the problem is:

This is a bad bug on our side, and we need to come up with a better design.

JanPalasek commented 3 months ago

@cscheid What is currently the best workaround? I have a Quarto Website that contains some notebooks rendered using databricks.

rareal commented 3 months ago

@cscheid What is currently the best workaround? I have a Quarto Website that contains some notebooks rendered using databricks.

I found that for my exported Databricks notebooks the metadata has the key application/vnd.databricks.v1+cell', so I'm just reading the ipnb files (they are just .json files) and removing it before quarto render. Like this:

# python
import json

def clean_dbk_exported_ipynb_file(file_path):
  print(f"working on file: {file_path}")
  # read file
  with open(file_path) as f:
      fc = json.load(f)
  # remove cell metadata
  for cell in fc['cells']:
      if cell['metadata'].get('application/vnd.databricks.v1+cell'):
          _ = cell['metadata'].pop('application/vnd.databricks.v1+cell')
  # write back
  with open(file_path, 'w') as fw:
      json.dump(fc, fw)

This is a bit dumb but it works for now ¯_(ツ)_/¯

mcanouil commented 3 months ago

@rareal Thanks for sharing. Are you using it as a pre-render script? You could leverage the pre-rener scripts and QUARTO_PROJECT_INPUT_FILES to run your script on all Jupyter Notebooks at render time.

(This is still a workaround, and a fix which won't require this is from users is still planned.)

JanPalasek commented 3 months ago

@rareal @mcanouil Can i render it this way without modifying files? I dont want changes in my git, I want to edit it only for the rendering.

mcanouil commented 3 months ago

Quarto doesn't commit/push anything. You could add a post-render that git reset --hard HEAD.

JanPalasek commented 3 months ago

@mcanouil Thats quite dangerous, one could lose a lot of work using this approach. I was wondering if there was possibility to avoid storing files and just render it without saving. Jupyter supports some kind of preprocessors, so maybe if theres a possibility to add a custom one. This also might be compatible with quarto preview.

mcanouil commented 3 months ago

@JanPalasek "dangerous" is quite relative. It is the case only if you have the habits to not commit your changes. Committing your changes regularly when making changes (not necessarily pushing them) is the basic of a Git workflow.

You can also make a script that copy your whole project into a temporary directory with the pre-render script. There are plenty of workarounds, design what you want/need.

cscheid commented 3 months ago

I was wondering if there was possibility to avoid storing files and just render it without saving

What do you mean "avoid storing files"? I'm not following. I don't understand "just render it without saving" either; rendering a file has to save something on the filesystem by definition. Can you clarify?

JanPalasek commented 3 months ago

@mcanouil Will it work with quarto preview? Or is there some kind of way to make it work? @cscheid I was wondering if there was some pre-processing functionality like the nbconvert has - https://nbconvert.readthedocs.io/en/latest/api/preprocessors.html . You define a class that pre-processes individually each cell and, e.g. right before rendering generating the output, deletes some metadata from it. It doesn't store the output anywhere, it pre-processes it in-memory. And then the pre-processed result is used to render the html. The main advantage is that we create no additional file, we don't need to modify the original ipynb and need to deal with git.

cscheid commented 3 months ago

It doesn't store the output anywhere, it pre-processes it in-memory.

Quarto has filters, which offer this kind of functionality after the engine has executed. But we don't have anything comparable to execute before engines.

EDIT: I was wrong, see below and https://quarto.org/docs/extensions/nbfilter.html more specifically.

JanPalasek commented 3 months ago

@cscheid Thank you, ipynb filters work. I used @rareal approach and applied it as a filter:

# filter.py
import sys

import nbformat

# read notebook from stdin
nb = nbformat.reads(sys.stdin.read(), as_version=4)

# prepend a comment to the source of each cell
for cell in nb.cells:
    if cell["metadata"].get("application/vnd.databricks.v1+cell"):
        cell["metadata"].pop("application/vnd.databricks.v1+cell")

# write notebook to stdout
nbformat.write(nb, sys.stdout)

This solution works with both quarto render and preview, there are no files that one needs to git reset --hard. You can even reference this filter globally with _metadata.yml file or in the Quarto Project config.

cscheid commented 3 months ago

Thank you for pointing this out!

Quarto is big and I completely missed this. For everyone else: https://quarto.org/docs/extensions/nbfilter.html