srstevenson / nb-clean

Clean Jupyter notebooks for version control. Remove metadata, outputs, and execution counts with Git and pre-commit support.
https://pypi.org/project/nb-clean
ISC License
135 stars 18 forks source link

When preserving output, remove the execution count on outputs of type execute_result #158

Closed SerLizar closed 1 year ago

SerLizar commented 1 year ago

When using the option to preserve outputs, the execution count of the output is currently preserved.

I have tested it and both notebook and lab have no problems rendering if this field is set to null like the excecution counts of the cells themselves.

It is only relevant to outputs of type execute_result since according to the spec, it's the only type of output with that field.

yasirroni commented 1 year ago

Are you sure? In my side, it set to null.

You can check the code https://github.com/srstevenson/nb-clean/blob/5ddf4f785d614a2d8d37aa47a07b48d474a13787/src/nb_clean/__init__.py#L227

SerLizar commented 1 year ago

Yes, that is the execution count of the actual cell, that, as expected, is set to null. However there is an additional execution_count field in outputs with a type of execute_result. Which is why this is only a problem when preserving output.

I made a simple example and ran nb-clean clean -o on it:

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5f785788-1ef4-4777-9608-fb653c0e8ce2",
   "metadata": {},
   "source": [
    "# output_type = execute_result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6b0196d-83b0-4fdc-9f34-1f5d97018f20",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Hello World!'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"Hello World!\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3b66637-ccf8-4541-8572-a76e8f87192f",
   "metadata": {},
   "source": [
    "# output_type = stream"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f427da76-86a8-4390-b469-69c584b26c35",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Hello World!\n"
     ]
    }
   ],
   "source": [
    "print(\"Hello World!\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}

Unfortunatly I do not know if you can have multiple execute_result output cells for the same code cell.

yasirroni commented 1 year ago

That is strange. In my editor, notebooks did not behave like that. Maybe you also can share your editor of choice? Also OS, etc?

SerLizar commented 1 year ago

Sure thing, do let me know if I'm missing something you may need.

OS: Windows 10 Pro Version 21H2 (OS Build 19044.2130) Editor: Jupyter Lab Python: Python 3.10.7 Terminal: PowerShell 7.2.6 (in Windows Terminal) Installed Packages (in a venv virtual environment):

Package              Version
-------------------- -----------
anyio                3.6.1
argon2-cffi          21.3.0
argon2-cffi-bindings 21.2.0
asttokens            2.0.8
attrs                22.1.0
Babel                2.10.3
backcall             0.2.0
beautifulsoup4       4.11.1
black                22.8.0
bleach               5.0.1
certifi              2022.9.24
cffi                 1.15.1
charset-normalizer   2.1.1
click                8.1.3
colorama             0.4.5
contourpy            1.0.5
cycler               0.11.0
debugpy              1.6.3
decorator            5.1.1
defusedxml           0.7.1
entrypoints          0.4
executing            1.1.0
fastjsonschema       2.16.2
fonttools            4.37.4
idna                 3.4
imbalanced-learn     0.9.1
ipykernel            6.16.0
ipython              8.5.0
ipython-genutils     0.2.0
jedi                 0.18.1
Jinja2               3.1.2
joblib               1.2.0
json5                0.9.10
jsonschema           4.16.0
jupyter_client       7.3.5
jupyter-core         4.11.1
jupyter-server       1.19.1
jupyterlab           3.4.7
jupyterlab-pygments  0.2.2
jupyterlab_server    2.15.2
kiwisolver           1.4.4
lxml                 4.9.1
MarkupSafe           2.1.1
matplotlib           3.6.1
matplotlib-inline    0.1.6
mistune              2.0.4
mypy-extensions      0.4.3
nb-clean             2.3.0
nbclassic            0.4.4
nbclient             0.6.8
nbconvert            7.0.0
nbformat             5.6.1
nest-asyncio         1.5.6
notebook             6.4.12
notebook-shim        0.1.0
numpy                1.23.3
packaging            21.3
pandas               1.5.0
pandocfilters        1.5.0
parso                0.8.3
pathspec             0.10.1
pickleshare          0.7.5
Pillow               9.2.0
pip                  22.3
platformdirs         2.5.2
prometheus-client    0.14.1
prompt-toolkit       3.0.31
psutil               5.9.2
pure-eval            0.2.2
pycparser            2.21
Pygments             2.13.0
pyparsing            3.0.9
pyrsistent           0.18.1
python-dateutil      2.8.2
pytz                 2022.4
pywin32              304
pywinpty             2.0.8
pyzmq                24.0.1
requests             2.28.1
scikit-learn         1.1.2
scipy                1.9.2
seaborn              0.12.0
Send2Trash           1.8.0
setuptools           63.2.0
six                  1.16.0
sniffio              1.3.0
soupsieve            2.3.2.post1
stack-data           0.5.1
terminado            0.16.0
threadpoolctl        3.1.0
tinycss2             1.1.1
tokenize-rt          4.2.1
tomli                2.0.1
tornado              6.2
traitlets            5.4.0
urllib3              1.26.12
wcwidth              0.2.5
webencodings         0.5.1
websocket-client     1.4.1
wheel                0.37.1

And again, this is only a problem when you have an output with the type execute_result. That is only when you have the notebook spit out the value of the last operation in the code cell like most data science examples use (instead of actually calling print).

yasirroni commented 1 year ago

Reproduced at https://github.com/srstevenson/nb-clean/pull/160

yasirroni commented 1 year ago

@SerLizar Can you test my solution https://github.com/srstevenson/nb-clean/pull/161?

SerLizar commented 1 year ago

Tested it, don't have an extensive collection of notebooks, but it worked perfectecly in all that I tried!

yasirroni commented 1 year ago

Let's hope it will be merged by @srstevenson.

SerLizar commented 1 year ago

Yeah, the only possible extra addition would be checking for those field to already be null with the check option, but it's not as important.

yasirroni commented 1 year ago

Yeah, the only possible extra addition would be checking for those field to already be null with the check option, but it's not as important.

Updated, implemented in https://github.com/srstevenson/nb-clean/blob/e86118f11ed6880e24237db0a973fde62cf140aa/src/nb_clean/__init__.py#L239


But I can't confirm if it skipped null or still replace it.