Open JanPalasek opened 3 months ago
Can you please share the notebook? It's hard for us to go from a screenshot alone.
It's shared as json in a details HTML tag under the "step to reproduce" section ;)
edit: I can reproduce with the file, but I cannot execute the cells as the document is itself not at all reproducible nor self-contained.
@JanPalasek Would you mind sharing a Notebook that is self-contained and does not require external files. Right now, we can't execute the Notebook.
@JanPalasek Sorry I misread your message!
@mcanouil if the output is present in the notebook we shouldn't need any more repro than that.
@mcanouil if the output is present in the notebook we shouldn't need any more repro than that.
I am not very familiar with Notebooks ways of working, and that the metadata in question had something to do with computation itself. So being able to execute would have answer this question.
I am not very familiar with Notebooks ways of working, and that the metadata in question had something to do with computation itself. So being able to execute would have answer this question.
By design, quarto allows users to render "computed" .ipynb files. This is necessary to support workflows where .ipynb files are created in computing environments different from those of rendering environments. Execution in google colab or huggingface spaces or other high-performance computing environments is most often how this happens. It's a common Jupyter workflow that we've decided to support.
As a result, we don't need to compute to see the bug. This report is reproducible: you've even reproduced it! It's a valid .ipynb file that when we call quarto render file.ipynb --to html --no-execute
produces broken HTML.
This has to do with quarto getting confused at the metadata cell being emitted by databricks having hierarchical keys. Consider the output of quarto convert
on that file:
$ quarto convert 9089.ipynb
...
$ cat 8089.qmd
---
title: DataBricks Notebooks
jupyter: python3
---
## Introduction
In this notebook, we try Quarto with DataBricks.
## Chapter
In the first chapter, we try multiple commands and observe their results.
```{python}
#| application/vnd.databricks.v1+cell: {cellMetadata: {byteLimit: 2048000, rowLimit: 10000}, inputWidgets: {}, nuid: 7039bc23-d898-4506-b24d-8f1002a66d18, showTitle: false, title: ''}
df = spark.read.table("samples.nyctaxi.trips")
df.show(5)
...
That nested YAML entry is going to eventually confuse Pandoc mightily.
From there, edit the .qmd file to add `keep-md: true` and, and render it like so:
$ quarto render 9089.qmd --no-execute --to html
The `.html.md` file now points to where things are going to go bad:
In this notebook, we try Quarto with DataBricks.
In the first chapter, we try multiple commands and observe their results.
::: {#7474b8c9 .cell application/vnd.databricks.v1+cell='{"cellMetadata":{"byteLimit":2048000,"rowLimit":10000},"inputWidgets":{},"nuid":"7039bc23-d898-4506-b24d-8f1002a66d18","showTitle":false,"title":""}'}
df = spark.read.table("samples.nyctaxi.trips")
df.show(5)
::: ...
And now we can plainly see the problem.
Fundamentally, the problem is:
This is a bad bug on our side, and we need to come up with a better design.
@cscheid What is currently the best workaround? I have a Quarto Website that contains some notebooks rendered using databricks.
@cscheid What is currently the best workaround? I have a Quarto Website that contains some notebooks rendered using databricks.
I found that for my exported Databricks notebooks the metadata has the key application/vnd.databricks.v1+cell'
, so I'm just reading the ipnb files (they are just .json files) and removing it before quarto render.
Like this:
# python
import json
def clean_dbk_exported_ipynb_file(file_path):
print(f"working on file: {file_path}")
# read file
with open(file_path) as f:
fc = json.load(f)
# remove cell metadata
for cell in fc['cells']:
if cell['metadata'].get('application/vnd.databricks.v1+cell'):
_ = cell['metadata'].pop('application/vnd.databricks.v1+cell')
# write back
with open(file_path, 'w') as fw:
json.dump(fc, fw)
This is a bit dumb but it works for now ¯_(ツ)_/¯
@rareal Thanks for sharing. Are you using it as a pre-render script?
You could leverage the pre-rener scripts and QUARTO_PROJECT_INPUT_FILES
to run your script on all Jupyter Notebooks at render time.
(This is still a workaround, and a fix which won't require this is from users is still planned.)
@rareal @mcanouil Can i render it this way without modifying files? I dont want changes in my git, I want to edit it only for the rendering.
Quarto doesn't commit/push anything.
You could add a post-render that git reset --hard HEAD
.
@mcanouil Thats quite dangerous, one could lose a lot of work using this approach. I was wondering if there was possibility to avoid storing files and just render it without saving. Jupyter supports some kind of preprocessors, so maybe if theres a possibility to add a custom one. This also might be compatible with quarto preview.
@JanPalasek "dangerous" is quite relative. It is the case only if you have the habits to not commit your changes. Committing your changes regularly when making changes (not necessarily pushing them) is the basic of a Git workflow.
You can also make a script that copy your whole project into a temporary directory with the pre-render script. There are plenty of workarounds, design what you want/need.
I was wondering if there was possibility to avoid storing files and just render it without saving
What do you mean "avoid storing files"? I'm not following. I don't understand "just render it without saving" either; rendering a file has to save something on the filesystem by definition. Can you clarify?
@mcanouil Will it work with quarto preview
? Or is there some kind of way to make it work?
@cscheid I was wondering if there was some pre-processing functionality like the nbconvert
has - https://nbconvert.readthedocs.io/en/latest/api/preprocessors.html . You define a class that pre-processes individually each cell and, e.g. right before rendering generating the output, deletes some metadata from it. It doesn't store the output anywhere, it pre-processes it in-memory. And then the pre-processed result is used to render the html. The main advantage is that we create no additional file, we don't need to modify the original ipynb
and need to deal with git.
It doesn't store the output anywhere, it pre-processes it in-memory.
Quarto has filters, which offer this kind of functionality after the engine has executed. But we don't have anything comparable to execute before engines.
EDIT: I was wrong, see below and https://quarto.org/docs/extensions/nbfilter.html more specifically.
@cscheid Thank you, ipynb filters work. I used @rareal approach and applied it as a filter:
# filter.py
import sys
import nbformat
# read notebook from stdin
nb = nbformat.reads(sys.stdin.read(), as_version=4)
# prepend a comment to the source of each cell
for cell in nb.cells:
if cell["metadata"].get("application/vnd.databricks.v1+cell"):
cell["metadata"].pop("application/vnd.databricks.v1+cell")
# write notebook to stdout
nbformat.write(nb, sys.stdout)
This solution works with both quarto render and preview, there are no files that one needs to git reset --hard
. You can even reference this filter globally with _metadata.yml
file or in the Quarto Project config.
Thank you for pointing this out!
Quarto is big and I completely missed this. For everyone else: https://quarto.org/docs/extensions/nbfilter.html
Bug description
When I export Jupyter Notebook created from DataBricks notebook, the output shows metadata for some reason.
Note that the problem is fixed by clearing metadata in the cells.
Steps to reproduce
Content of the ipynb file
```json { "cells": [ { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "493a7a31-3e86-4d2b-813c-951fc7593ea3", "showTitle": false, "title": "" } }, "source": [ "# DataBricks Notebooks\n", "\n", "## Introduction\n", "\n", "In this notebook, we try Quarto with DataBricks.\n", "\n", "## Chapter\n", "\n", "In the first chapter, we try multiple commands and observe their results." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "7039bc23-d898-4506-b24d-8f1002a66d18", "showTitle": false, "title": "" } }, "outputs": [ { "output_type": "stream", "name": "stdout", "output_type": "stream", "text": [ "+--------------------+---------------------+-------------+-----------+----------+-----------+\n|tpep_pickup_datetime|tpep_dropoff_datetime|trip_distance|fare_amount|pickup_zip|dropoff_zip|\n+--------------------+---------------------+-------------+-----------+----------+-----------+\n| 2016-02-14 16:52:13| 2016-02-14 17:16:04| 4.94| 19.0| 10282| 10171|\n| 2016-02-04 18:44:19| 2016-02-04 18:46:00| 0.28| 3.5| 10110| 10110|\n| 2016-02-17 17:13:57| 2016-02-17 17:17:55| 0.7| 5.0| 10103| 10023|\n| 2016-02-18 10:36:07| 2016-02-18 10:41:45| 0.8| 6.0| 10022| 10017|\n| 2016-02-22 14:14:41| 2016-02-22 14:31:52| 4.51| 17.0| 10110| 10282|\n+--------------------+---------------------+-------------+-----------+----------+-----------+\nonly showing top 5 rows\n\n" ] } ], "source": [ "df = spark.read.table(\"samples.nyctaxi.trips\")\n", "df.show(5)" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "3754cc12-1f1f-405e-8525-b923f1b42a1e", "showTitle": false, "title": "" } }, "source": [ "This is text in-between the commands." ] }, { "cell_type": "code", "execution_count": 0, "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "21c1cb83-83cc-40c8-9a8b-f5378d3f29be", "showTitle": false, "title": "" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from databricks.sdk.runtime import dbutils\n", "dbutils.fs.ls(\"dbfs:/Workspace/Users/\")" ] }, { "cell_type": "markdown", "metadata": { "application/vnd.databricks.v1+cell": { "cellMetadata": { "byteLimit": 2048000, "rowLimit": 10000 }, "inputWidgets": {}, "nuid": "d5627e1f-aecf-486b-b6f5-0eb6e3144e6d", "showTitle": false, "title": "" } }, "source": [ "## Conclusion\n", "\n", "Currently, Quarto does not fully work, at least not rendering." ] } ], "metadata": { "application/vnd.databricks.v1+notebook": { "dashboards": [], "language": "python", "notebookMetadata": { "pythonIndentUnit": 4 }, "notebookName": "dbx_exploration", "widgets": {} } }, "nbformat": 4, "nbformat_minor": 0 } ```Expected behavior
No metadata present in the html output.
Actual behavior
Metadata are visible in the output report. I believe quarto should ignore cell metadata that it does not know but maybe I'm wrong.
Your environment
Linux 2c9ff6fff677 4.18.0-425.19.2.el8_7.x86_64 #1 SMP Tue Apr 4 22:38:11 UTC 2023 x86_64 GNU/Linux
.Quarto check output
Quarto 1.4.551 [✓] Checking versions of quarto binary dependencies... Pandoc version 3.1.11: OK Dart Sass version 1.69.5: OK Deno version 1.37.2: OK [✓] Checking versions of quarto dependencies......OK [✓] Checking Quarto installation......OK Version: 1.4.551 Path: /opt/quarto/1.4.551/bin
[✓] Checking tools....................OK TinyTeX: (not installed) Chromium: (not installed)
[✓] Checking LaTeX....................OK Tex: (not detected)
[✓] Checking basic markdown render....OK
[✓] Checking Python 3 installation....OK Version: 3.11.8 Path: /usr/local/bin/python3 Jupyter: 5.7.2 Kernels: python3
(-) Checking Jupyter engine render....[IPKernelApp] WARNING | Unknown error in handling startup files: [✓] Checking Jupyter engine render....OK
[✓] Checking R installation...........(None)