jupyter / nbformat

Reference implementation of the Jupyter Notebook format
http://nbformat.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
265 stars 151 forks source link

Should kernel info be included in notebook signature computation? #296

Open afshin opened 2 years ago

afshin commented 2 years ago

@divyansshhh opened an issue in JupyterLab that is more appropriate for jupyter/nbformat, so this issue is meant to replace the original in JupyterLab.

Problem

At present, when computing the signature of a notebook, the compute_signature function (ref) takes into consideration the kernel info. I believe this shouldn't be done because it has no effect on the safety of the output.

Proposed Solution

If there is no reason to include kernel info while computing the notebook signature then it should be stripped out.

Additional context

In our use-case we have a custom contents manager that can be used to access notebooks from a remote host in a read-only mode. Now, if the original notebook was authored with say kernel1 then opening that in a host which doesn't have a kernel named kernel1 makes the notebook untrusted.

Carreau commented 2 years ago

I believe this shouldn't be done because it has no effect on the safety of the output

I'm not sure this is true , because I'm not sure in general one should consider only the output as being unsafe for the trust model in general.

While the trust in jupyter lab and jupyter server is used in this way it does not mean nbformat is used in a different way by someone else.

Trust can be for example used to know whether to run a notebook in a cron as root, maybe. And then if the kernel can be changed, why not have it point to an attacker controlled executable. Or have a code that means different things in 2 different kernels.

krassowski commented 2 years ago

Loosely related previous discussion: compute_signature should skip all transient properties not only signature #234

I think that there are multiple things that the current trust mechanism attempts to be:

  1. integrity certificate ("this notebook or its outputs were not modified manually in another application")
    • improvements to this area (including a timestamp, previous checkpoint hash, etc) would be useful for complying with the letter of the FDA guidance on audit trial for drug trial analyses
  2. protection from unwanted output (e.g. JavaScript) execution on startup ("it is safe to open this notebook if you trust the sender") or Markdown content, analogous to "trust" concept for executable macros in Office
  3. protect from kernel-side runtime security pitfalls (e.g. swapped kernelspec, embedded/invisible code) ("it is safe to run this notebook if you trust the sender")

Should we split those into multiple signatures?

I would also argue that the ecosystem is showing signs of being annoyed with the current trust implementation. Any security mechanism which is sufficiently annoying will be circumvented by users (as the common example of complexity requirements on passwords). Here are some quick examples of this happening in the wild with the notebook trust:

I think that we could introduce a new granular trust system covering the three use cases as described above in backward-compatible way, so the roll-out would be gradual and would not too be costly.

Edit clarifying note: for (1) I think that kernel info should be included because different kernels could lead to different results, so a change here would indicate (depending on use-case) some form of tampering or write/read problem (2) should not include kernel info as it is irrelevant and annoying (3) should include kernel info due to reasons outlined above.

krassowski commented 2 years ago

Also linking to a previous discussion on trust in Real Time Collaboration scenario, where per-output/widget trust issue was discussed: https://github.com/jupyterlab/jupyterlab/pull/11494.

Carreau commented 2 years ago

For me most of these are either bugs or conflation of "save" with "export". Currently the ipynb is becoming both a persistent store for application state and an exchange format.

IMHO the jupyter server should store in whatever binary format that is incompatible between version somewhere, and over option to "export", or potentially auto-export version of the files in .md, .ipynb or whatever you like.

This is how most applications work today when you have complex structure. Especially with RTC that needs complex informations there is no reason to try to shove things into the ipynb.