jupyter / nbformat

Reference implementation of the Jupyter Notebook format
http://nbformat.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
266 stars 151 forks source link

Add unique ID to the notebook metadata #148

Open betatim opened 4 years ago

betatim commented 4 years ago

It would be nice to have a "practically unique across the universe identifier" in the notebook metadata. This would allow you to recognise a notebook based on this ID. Right now if Alice and Bob have a copy of the same notebook there is no way to know if they are the same or not. Even for Alice on her laptop and her desktop this is hard. If the notebook contained a unique ID it would be clear that (at some point) these were the same notebook.

I can think of three use cases:

I'd propose that the notebook format starts recommending that tools which create notebooks add a "unique_id" field to the notebook level metadata that contains a value like uuid.uuid4().hex. The value of this field should not be changed by reading and writing to the notebook.


I am new to this repo so please close this and link to an existing issue/PR if there is one. I searched for "unique" and didn't find anything.

westurner commented 4 years ago

For W3C Web Annotations (JSONLD RDF), is it necessary to associate threaded comments and highlights with a URI subject? A UUID URN could be the canonical identifier for Web Annotations. [1][2]

A UUID can be a URI when it has the urn:uuid URN namespace prefix:

From https://en.wikipedia.org/wiki/Universally_unique_identifier#Format :

RFC 4122 defines a Uniform Resource Name (URN) namespace for UUIDs. A UUID presented as a URN appears as follows:[2]

urn:uuid:123e4567-e89b-12d3-a456-426655440000

[1] https://github.com/w3c/wpub/issues/56#issuecomment-325512520 links to [2] https://www.w3.org/TR/annotation-model/#model-14 [3] https://schema.org/identifier [4] https://schema.org/url

When would the UUID need to be changed?

What sort of UI does this need?

MSeal commented 4 years ago

I'm a little late to the thread but I agree having a uuid in the notebook format would be hugely beneficial.

It might be required to only change the ID when the location or name of the notebook changes to be compatible with existing assumptions made in applications. This basically follows what is proposed here, by implying that copying a notebook should impose an id update. But edits to an existing notebook would not.

The hard part for this is, how does one treat non-application copies? If a user does cp Notebook1.ipynb Notebook2.ipynb the ids would be the same to a notebook server. In this case I'm not sure what the business rules should be for the server dealing the duplicate ids from multiple notebooks at runtime.

betatim commented 4 years ago

My suggestion of adding a UUID comes from looking at the PDF format (originally motivated by wanting to understand how hypothesis does its magic).

https://www.seanh.cc/2017/11/22/pdf-fingerprinting/ is a good&readable post on some of the basics and how to extract fingerprints from PDFs.

For PDFs the idea is that copying, renaming and so on does not change the ID. I think that makes sense as the reason to have the ID is to be able to identify that two files are the same independent of the filename or URL. PDFs have a second ID which starts the same as the first but is updated when the content changes.

I think cp Untitled99.ipynb foobuzz.ipynb shouldn't change the ID as that is making a copy, not modifying the content. What would a notebook server do with the ID or why would it need to handle the conflict? I am imagining that it could show the user a message like "you have a copy of this open in another tab as well" or some information about the heritage of the notebook.

I think we could do worse than to copy what PDF does (use two IDs). Their spec is in section 14.4. of https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf

MSeal commented 4 years ago

That's a fairly compelling argument for ID management, with a similar usecase and well established precedent. It doesn't solve all problems, like users copying from a common starter notebook for all their patterns, but it would give a lot more information and insight into notebooks in a way that handles file copy and edits a little better. I'm for this pattern -- do you think we should formulate a JEP for the idea?

blois commented 4 years ago

Google Colab includes a 'provenance' section in the colab-specific notebook metadata, this is an array of what we consider IDs indicating where the file came from:

  "metadata": {
    "colab": {
      "provenance": [
        {
          "file_id": "1Rgt3Q7hVgp4Dj8Q7ARp7G8lRC-0k8TgF",
          "timestamp": 1560453945720
        },
        {
          "file_id": "https://gist.github.com/blois/057009f08ff1b4d6b7142a511a04dad1#file-post_run_cell-ipynb",
          "timestamp": 1560453945720
        }
      ],

Every time the file is cloned from within Colab we push a new entry into that list indicating where the file was cloned from.

The file_id field is what our service considers the canonical path to the notebook- the github path for github-based notebooks or Google Drive's file ID for Drive based notebooks.

We don't make heavy use of this data because:

if Alice and Bob have a copy of the same notebook there is no way to know if they are the same or not.

and

show the same comments on all copies

If they are copies then it does not seem that they are the same notebook. This seems unexpected for comments on Alice's copy to be shown to Bob.

For persistence in browser storage- is there a canonical URL for the notebook that BinderHub could use?

westurner commented 4 years ago

There's a W3C spec for data like this: (1) where the inputs came from; (2) where the outputs come from; and (3) "who is making said claims with which cryptographic signature" requires additional specs like Linked Data Signatures, W3C Verifiable Claims (and Decentralized Identifiers). It should actually be really easy to instead specify this data (an an inter-tool-compatible way) with W3C PROV in JSON-LD.

https://www.w3.org/TR/prov-overview/

https://www.w3.org/TR/prov-primer/#introduction

https://en.wikipedia.org/wiki/PROV_(Provenance)

The PROV standard defines a data model, serializations, and definitions to support the interchange of provenance information on the Web.[1] Here provenance includes all "information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness".

https://www.w3.org/TR/prov-primer/#derivation-and-revision-1

@prefix exg: <http://example.org/doc1#> .
@prefix prov: <http://www.w3.org/ns/prov#> .

exg:dataset2 a prov:Entity ;
    prov:wasRevisionOf exg:dataset1 .

There aren't JSON-LD examples in the prov primer; but you can convert them from turtle (N3) to JSON-LD through an online rdf translator (or rdfpipe from e.g. rdflib )

{
  "@context": {
    "exg": "http://example.org/doc1#",
    "prov": "http://www.w3.org/ns/prov#",
  },
  "@id": "exg:dataset2",
  "@type": "prov:Entity",
  "prov:wasRevisionOf": {
    "@id": "exg:dataset1"
  }
}

When/where the document was copied, revised, and executed would be useful information to share in a tool-independent way.

westurner commented 4 years ago

For persistence in browser storage- is there a canonical URL for the notebook that BinderHub could use?

Possible identifiers and resolvable URLs for a notebook:

  1. The repo URL (https://schema.org/url): https://github.com/org/repo/blob/master/notebooks/example1.ipynb https://github.com/org/repo/tree/master/notebooks/example1.ipynb

  1. The origin URL of the actual raw ipynb (https://schema.org/url): https://raw.githubusercontent.com/org/repo/master/notebooks/example1.ipynb
  2. The versioned URL of the ipynb (https://schema.org/url) containing a revid: https://raw.githubusercontent.com/org/repo/abc12345/notebooks/example1.ipynb
  3. A uuid4 UUID URN URI (https://schema.org/identifier): urn:uuid:123e4567-e89b-12d3-a456-426655440000
  4. A UUID and a hash of the notebook inputs and/or outputs: https://ns.jupyter.org/urn:uuid:123e4567-e89b-12d3-a456-426655440000/sha256hashedcba4321
  5. A namespaced combination of (URL, revid, UUID, hash / signed hash)

  1. An nbviewer URL (with/without a revision identifier): https://nbviewer.jupyter.org/github/jrjohansson/qutip-lectures/blob/master/Lecture-0-Introduction-to-QuTiP.ipynb
  2. A binderhub URL (with/without a revision identifier): https://mybinder.org/v2/gh/jrjohansson/qutip-lectures/master?filepath=Lecture-0-Introduction-to-QuTiP.ipynb
  3. A binderhub instance URL (with/without a revision identifier): https://notebooks.gesis.org/binder/jupyter/user/jrjohansson-qutip-lectures-jyimm9cg/notebooks/Lecture-0-Introduction-to-QuTiP.ipynb
betatim commented 4 years ago

I'm for this pattern -- do you think we should formulate a JEP for the idea?

I'd be up for this.

canonical/resolvable URL

What is the difference between a canonical/resolvable URL (at which you can't actually download the notebook) and a unique identifier (that is a a random number)? Is there an advantage to using one or the other? Number 4 or 5 from the list above seem good. I think anything that gives away where a notebook came from (GitHub repo, local file path, etc has the potential for a privacy disaster).

what are the semantics of "copy" and "use as template" and such

I think a copy of a file made with a tool like cp, mv or a file explorer GUI is a "copy" in the sense that you would want to see the comments. For example I email you a notebook but you save it to a different path. It is still the same notebook and we'd want to see the same comments on it. A comment tool could also implement private vs public vs group comments (as hypothesis does).

I think I'd start a JEP with proposing to include two identifiers in the metadata. Both are chosen "somehow" when the notebook is first created. The first identifier never changes, the second gets changed "somehow" when a user edits or otherwise "meaningfully changes" the notebook. This lets us tell if two notebooks are copies of each other, just located in different parts of the galaxy, if one notebook was somehow derived from another one (shared first identifier, different second one) or if they are completely unrelated. It also means that we don't have to scrub notebooks before people share them (you don't learn very much from getting my notebook and looking at the identifiers).

judell commented 4 years ago

Hi Wes,

On Twitter you asked about a durable ID to associate annotations to a notebook page. Here's an example of what we recommend:

<meta name="dc.identifier" content="blog-article/e3d858b3″>
<meta name="dc.relation.ispartof" content="elifesciences.org">

Together they form this URL-independent identifier:

urn:x-dc:elifesciences.org/blog-article/e3d858b3

https://web.hypothes.is/help/how-hypothesis-interacts-with-document-metadata/#dublin-core-metadata

meeseeksmachine commented 4 years ago

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/annotating-jupyter-notebooks/2079/6

meeseeksmachine commented 4 years ago

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/annotating-jupyter-notebooks/2079/7

meeseeksmachine commented 4 years ago

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/annotating-jupyter-notebooks/2079/12

krassowski commented 2 years ago

Since this is about tracking notebooks, I wanted to link to recent work on File ID service for jupyter-server: https://github.com/jupyter-server/jupyter_server/issues/940, https://github.com/jupyter-server/jupyter_server_fileid, https://github.com/jupyterlab/jupyterlab/issues/12614. It seems that motivation for File ID service and this discussion was similar (enabling comment tracking). CC @ellisonbg @dlqqq just to reconcile the discussions in case if you have not seen this one in a while.