srstevenson / nb-clean

Clean Jupyter notebooks of outputs, metadata, and empty cells, with Git integration
https://pypi.org/project/nb-clean/
ISC License
128 stars 18 forks source link

Cannot clean notebooks encountering “NotJSONError” with plotly js code inside #273

Closed firezym closed 1 month ago

firezym commented 2 months ago

@srstevenson Thanks for this awesome repo. I am having some trouble cleaning notebooks with html/js inside. Below is the detailed error. Please kindly check it out :)

System :

Windows Server 2022 Datacenter 21H2 20348.2402

Core Packages :

jupyterlab >= 4.0.10 nbformat 5.9.2 nb-clean 3.2.0 plotly 5.18.0

Core Commands :

nb-clean add-filter
git add plotly-example-2.ipynb

It works well on notebooks without plotly. But getting error from this notebook with plotly's html js snippets in it. plotly-example-2.zip

Error :

I checked the json format. It happens on line 29 which is the beginning of a chunk of js snippet having confusing "" in it.

(dev) PS D:\Dapu\prod> git add plotly-example-2.ipynb
Traceback (most recent call last):
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\reader.py", line 19, in parse_json
    nb_dict = json.loads(s, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 29 column 301224 (char 302194)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "D:\ProgramData\miniconda3\envs\dev\Scripts\nb-clean.exe\__main__.py", line 7, in <module>
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nb_clean\cli.py", line 298, in main
    args.func(args)
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nb_clean\cli.py", line 150, in clean
    notebook = nbformat.read(input_, as_version=nbformat.NO_CONVERT)  # type: ignore[no-untyped-call]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\__init__.py", line 174, in read
    return reads(buf, as_version, capture_validation_error, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\__init__.py", line 92, in reads
    nb = reader.reads(s, **kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\reader.py", line 75, in reads
    nb_dict = parse_json(s, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\ProgramData\miniconda3\envs\dev\Lib\site-packages\nbformat\reader.py", line 25, in parse_json
    raise NotJSONError(message) from e
nbformat.reader.NotJSONError: Notebook does not appear to be JSON: '{\n "cells": [\n  {\n   "cell_type": "c...
error: external filter 'nb-clean clean' failed 1
error: external filter 'nb-clean clean' failed
warning: in the working copy of 'plotly-example-2.ipynb', LF will be replaced by CRLF the next time Git touches it

Can not reproduce using nbformat directly in python:

When I use nbformat to load, such error will not happen. It seems fine to get the whole html content in notebook['cells'][0]['outputs'][0]['data']['text/html'].

import nbformat
filename = "plotly-example-2.ipynb"
with open(filename, 'r', encoding='utf-8') as f:
    notebook = nbformat.read(f, as_version=nbformat.NO_CONVERT)

notebook['cells'][0]['outputs'][0]['data']['text/html']
srstevenson commented 2 months ago

I'm not able to reproduce this using the same versions of nb-clean and nbformat, either using the Git filter or invoking nb-clean manually:

$ nb-clean check plotly-example-2.ipynb
plotly-example-2.ipynb cell 0: metadata
plotly-example-2.ipynb cell 0: execution count
plotly-example-2.ipynb cell 0: outputs
plotly-example-2.ipynb metadata: language_info.version

However, I'm on Linux whereas you're on Windows and there's a warning from Git that LF line endings will be replaced with CRLF line endings on checkout in your output. To see if the line ending conversion is involved, do you have the same error if you run nb-clean outside the Git filter (nb-clean check plotly-example-2.ipynb)?

firezym commented 2 months ago

I can pass the $ nb-clean check plotly-example-2.ipynb on windows powershell command line too, returning the same results as you.

But when I use $ git add plotly-example-2.ipynb, I still get the same error showing above.

My CRLF setting in the git global config file C:\Users\Administrator\.gitconfig is as following

[core]
    autocrlf = input

Should I alter the autocrlf setting to something else?

srstevenson commented 2 months ago

According to this PR in another project, Jupyter notebooks are always created with LF line endings on Windows. That suggests adding the following to the .gitattributes file in your repository (if you've not worked with the .gitattributes file before, there's documentation on its purpose and the available options here):

*.ipynb  text eol=lf
srstevenson commented 1 month ago

I'll assume configuring .gitattributes worked: if you have any other trouble please open a new issue.