jupyter / nbformat

Reference implementation of the Jupyter Notebook format
http://nbformat.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
265 stars 151 forks source link

compute_signature should skip all transient properties not only signature #234

Open nanoant opened 3 years ago

nanoant commented 3 years ago

JupterLab has started recently unconditionally adding orig_nbformat: 1 into edited notebook since https://github.com/jupyterlab/jupyterlab/pull/10118. This broke notebook trust, since orig_nbformat is included in sign.compute_signature but is later stripped in v4/rwbase.strip_transient on save. Therefore digest computed before save does not match one restored.

sign.compute_signature should not only do signature_removed but skip all transient metadata in order to ensure that we compute signature on exactly the same structure as it will be saved to disk.

I am not jupyter developer so I don't want to propose PR as I don't know what is desired implementation of the fix for this problem.

This issue is related to https://github.com/jupyterlab/jupyterlab/issues/11005

Carreau commented 2 years ago

There seem to be other recent issues with signature/validation/saving, and yes I agree that we should be 1) stricter to the fields we accept, 2) better document which fields are computed in signature.

My take is that we should have a clean(). method somewhere that return a cleaned copy of the notebook data structure that can be properly signed instead of avoiding some fields. But that's my personal opinion.

That way sign is simple, and can raise if ever there are fields it does not know about.

Pingin @echarles as we were talking about other signature issues this morning, and it might be of interest to him.

westurner commented 2 years ago

My take is that we should have a clean(). method somewhere that return a cleaned copy of the notebook data structure that can be properly signed instead of avoiding some fields.

This sounds like a document Canonicalization or Normalization step. LD-proofs now specifies how to future-proof inlined JSON-LD document signatures.

From https://w3c-ccg.github.io/ld-proofs/#advanced-terminology :

Canonicalization algorithm An algorithm that takes an input document that has more than one possible representation and always transforms it into a deterministic representation. For example, alphabetically sorting a list of items is a type canonicalization. This process is sometimes also called normalization.

A complete example of a proof type is shown in the next example:

EXAMPLE 7

{
 "id": "https://w3id.org/security#Ed25519Signature2020",
 "type": "Ed25519VerificationKey2020",
 "canonicalizationAlgorithm":  "https://w3id.org/security#URDNA2015",
 "digestAlgorithm": "https://www.ietf.org/assignments/jwa-parameters#SHA256",
 "signatureAlgorithm": "https://w3id.org/security#ed25519"
}

From https://json-ld.github.io/rdf-dataset-canonicalization/spec/#introduction :

When data scientists discuss canonicalization, they do so in the context of achieving a particular set of goals. Since the same information may sometimes be expressed in a variety of different ways, it often becomes necessary to be able to transform each of these different ways into a single, standard format. With a standard format, the differences between two different sets of data can be easily determined, a cryptographically-strong hash identifier can be generated for a particular set of data, and a particular set of data may be digitally-signed for later verification.

In particular, this specification is about normalizing RDF datasets, which are collections of graphs. Since a directed graph can express the same information in more than one way, it requires canonicalization to achieve the aforementioned goals and any others that may arise via serendipity.

Most RDF datasets can be normalized fairly quickly, in terms of algorithmic time complexity. However, those that contain nodes that do not have globally unique identifiers pose a greater challenge. Normalizing these datasets presents the graph isomorphism problem, a problem that is believed to be difficult to solve quickly. More formally, it is believed to be an NP-Intermediate problem, that is, neither known to be solvable in polynomial time nor NP-complete. Fortunately, existing real world data is rarely modeled in a way that manifests this problem and new data can be modeled to avoid it. In fact, software systems can detect a problematic dataset and may choose to assume it's an attempted denial of service attack, rather than a real input, and abort.

This document outlines an algorithm for generating a normalized RDF dataset given an RDF dataset as input. The algorithm is called the Universal RDF Dataset Canonicalization Algorithm 2015 or URDNA2015.

Maybe ~URDNA2015 + special custom nbformat normalizations would be ideal. Or, at least nothing that would preclude later use of URDNA2015 (and JSON-LD (for nb metadata and JSON-LD cell outputs))?