Metadata service, client, UI overview

ellisonbg commented 5 years ago

This is an issue that provides an overview of the proposed metadata service, client, and UI being developed in this repo.

Background

Entities in the JupyterLab universe (notebooks, text files, datasets, models, visualizations, etc.) often have rich context and metadata associated with them. Examples include:

Authors
Related works and publications
Data catalogs
Institutions
Publishers
Code snippets

This rich context is incredibly useful to groups of people working with code and data. This goal of this work is to build a metadata architecture to enable Jupyter users to collaboratively create and explore metadata for any entity in the Jupyter universe.

What metadata standard

We have considered a number of existing metadata standards, and the one that is emerging as a top candidate is that of https://schema.org/. It appears to be rich enough to describe the different types of metadata we encounter in the Jupyter universe. In talking with potential users of this system, that flexibility seems to be important.

See https://github.com/jupyterlab/jupyterlab/issues/5733 for additional discussion about metadata schema.

Implementation

The current proposal is to create a Jupyter notebook server extension that is a GraphQL service for the relevant subset of the schema.org metadata schema. We haven't worked through what subset of the schema is relevant for Jupyter, but it will probably be the document and data related ones (probably won't start with things like https://schema.org/FlightReservation).

The usage of GraphQL is important because we imagine a wide range of complex UIs being built to display and edit this highly nested metadata. Being able to get back and edit rich data in single queries will be really helpful on the frontend.

We haven't decided if this notebook server extension will be written in Python or node.js (or both), but it shouldn't really matter.

For a client, we are imagining a TypeScript based library that provides a thin, well-typed API for talking to the service.

The notebook server and TypeScript client library should be entirely independent of JupyterLab and useful outside of it.

Finally, we plan on creating a JupyterLab extension that offers a user experience for editing and viewing metadata for entities in JupyterLab. Initial work will focus on notebooks, data sets, text documents.

Initially, this repo will create our explorations of the notebook server, TypeScript client, and JupyterLab UI extension, but these may be separated out over time.

ellisonbg commented 5 years ago

Another big question that is coming up is how to treat code snippets. While I list them above as metadata, we may want to have a separate code snippet extension in JupyterLab.

saulshanabrook commented 5 years ago

Some notes from our meeting:

The primary initial use case is one JupyterLab server that has multiple simultaneous users. One person opens a CSV file and adds some metadata about it, like that it has a author or something, and then another user who has the file open sees this field, when look at the metadata for that file.

We discussed using files instead of GraphQL to store this, i.e. like one file per metadata object, but then talked about then how we couldn't have two users edit the same object at the same time, one would get clobbered.

Resources:

https://github.com/prisma/prisma: Support MySQL and PostgreSQL databases
- See https://github.com/prisma/prisma/issues/2460 for supporting SQLite or in memory
https://comunica.github.io/Article-ISWC2018-Demo-GraphQlLD/ GraphQL and JSON-LD
- In this work, we propose GraphQL-LD as a technique for combining the worlds of GraphQL and the Semantic Web. We provide an implementation of this approach, and demonstrate this with a set of example queries.
  
  In future work, we intend to formalize our GraphQL-LD conversion algorithm. Furthermore, we intend to improve the way in which we determine which variables should be considered singular or plural. OWL’s InverseFunctionalProperty or JSON-LD framing are potential options that we consider for this.
  
  In summary, this work allows GraphQL developers to query the Linked Open Data cloud. But also Linked Data experts can use it an an alternative to SPARQL.
- https://github.com/rubensworks/graphql-to-sparql.js

rgbkrk commented 5 years ago

Let me just say I'm really happy about the direction with GraphQL in Jupyter. As far as tooling is concerned, I'd prefer to use the really well supported node.js backend tooling for GraphQL. It also helps that you can share types from backend to frontend. This likely means having to use something like nbserverproxy though.

saulshanabrook commented 5 years ago

I'd prefer to use the really well supported node.js backend tooling for GraphQL

Me too. Do you have any suggestions for key packages that would be helpful to look at? Subscriptions seem pretty useful, and I was looking at this: https://github.com/prisma/prisma-examples/tree/master/typescript/graphql-subscriptions

EDIT: looks like rec is to move to Apollo Server 2 instead of using yoga https://github.com/prisma/graphql-yoga/issues/449#issuecomment-430540661

rgbkrk commented 5 years ago

Yeah, I'd recommend Apollo Server.

@captainsafia and I have been working on a new server that provides a GraphQL API for managing communication between a Jupyter kernel and clients.

captainsafia commented 5 years ago

You can find the code for this work at https://github.com/nteract/nteract/tree/master/packages/kernel-relay.

It's designed to provide more interaction-based endpoints as opposed to resource-based endpoints in REST APIs. For example, I want to launch a kernel, I want to subscribe to the status of a kernel, I want to execute this code snippet, etc.

bollwyvl commented 5 years ago

Yeah, I'd recommend Apollo...

:100: to apollo's front end stack.

Server.

+0 in hub deployments, as we've already got configurable-http-proxy there.

:-1: to nodejs on the single user server. I'll toss out this prototype based on graphene which, while having some growth challenges, in 2019 is still a lot more supportable on end-user machines than nodejs. Also if we get too far down the "reference implementations of frontend and backend can share code," there's a pretty good chance there will never be another implementation of either.

Stepping back from either implementation, would be to have a canonical:

serialization schema a la nbformat (graphql doesn't dictate one, but turns out you gotta have something)
json-ld context
a graphql schema graphql SDL ...which "a jupyter metadata server" must conform to, akin to nbformat, but with only an implemented conformance test suite.

Given that an (opinionated) GraphQL schema from a json-ld context, i think this gives us the most robustness. There isn't a lot of precedence (or support) for combining un-coordinated GraphQL type libraries at present, and I'd really hate it if this feature set regressed on the hackability of our other tools: if a community wants to add and search by their microscope metadata schema to something, I don't want the answer to be fork, because it doesn't "fit" in "schema.org" or "Jupyter's GraphQL thing"

saulshanabrook commented 5 years ago

👎 to nodejs on the single user server. I'll toss out this prototype based on graphene which, while having some growth challenges, in 2019 is still a lot more supportable on end-user machines than nodejs.

One thing to keep in mind is that I believe we would need graphql subscriptions support. In my last chat with Brian, one of the big drivers of using graphql is the ability for two users to be editing the same metadata at the same time and to have the changes mirrored between the two. Subscriptions would allow each client to get notified when another client made a change with the updated data, I believe (I am not a GraphQL expert, someone correct me if you don't need subscriptions for this use case).

They aren't included in Graphene at the moment (https://github.com/graphql-python/graphene/issues/430#issuecomment-450015544, https://github.com/graphql-python/graphene/issues/393), but a few people have added custom support to use Django's channels with graphene: https://github.com/eamigo86/graphene-django-subscriptions, https://github.com/datadvance/DjangoChannelsGraphqlWs

json-ld context

TBH I am not familiar with what JSON-LD context really is. I will have to investigate that. I get scared whenever I see RDF/SPARQL cause it's a whole other world that I don't know much about. Do we keep a bunch of RDF files in disk? Run a SPARQL server?

AFAIK the initial work on the metadata service isn't about using graphql or schema.org for notebook's themselves. Instead, my understanding is that it's more like a seperate context microservice. You might have comments on a notebook, and those would be stored persistently with some reference to the notebook, like it's file path and the cell ID of the comment.

ellisonbg commented 5 years ago

Thanks everyone. One of the benefits of going with GraphQL is that in the long term, the details of the server implementation, persistence mechanism, python/node.js, etc. are less important than the protocol, GraphQL schema, etc.

I would love to be able use Python for this, but I think that today it makes sense to use the approach that enables us to explore and iterate quickly on the schema and queries, with a solid GraphQL implementation that won't get in our way. To me that suggests starting with Apollo. I do think that jsonld will be important in the long term, but I don't think we need to tackle that starting out.

bollwyvl commented 5 years ago

isn't about using graphql or schema.org for notebook's themselves

Sure, but we have a lot of shipped stuff that look like complex, potentially evented graph data models! Also if we're talking about putting any of this stuff into nbformat itself, it's worth thinking about how these things might play.

need graphql subscriptions

GraphQL isn't going to solve the conflict-resolution problem, but it's great that it has a spec for delivering changes and multiple implementations. On that note, added subscriptions to contents on that prototype based on this PR:

jsonld will be important in the long term, but I don't think we need to tackle that starting out.

Like it or lump, if we're buying schema.org and web annotation, we're getting JSON-LD.

JSON-LD context really is

Those two contexts are how an SEO consortium and a standards body, respectively, see parts of the world and name things, and do have some incompatibilities (17, but some are spurious).

A key, relevant distinction is that both define Dataset: schema.org defines it's own, while web annotation uses dublin core. ANYHOO... it doesn't really matter, as we will end up needing a jupyter one at some point, but it's worth being on the lookout for things that the data can do for us.

Do we keep a bunch of RDF files in disk? Run a SPARQL server?

Like any other JSON, you can treat JSON-LD as a serialization format, (de-)normalize it, and then store it however you want, or accept its graph nature (and maybe also normalize it, unless your store speaks JSON-LD directly).

Taking the former route, flat files first is great (see FileContentsManager)! However, once you have multiple writers, it's hard to ignore Postgres (BSD-like) with LISTEN for events and the ability to store and query JSON. couchdb (APL) was super fun back in the day, and would probably still work great.

Taking the latter, the most cross-platform graph database I've used is virtuoso (GPL). RedisGraph (APL+CrazyClause), edgedb (APL), and gundb (APL) are new and fast and cool. All of them let you do things that can be pretty tricky with an RDBMS, but no two of those suggestions use the same query language natively. But yeah, the major languages with multiple implementations for arbitrary, potentially circular graph traversal are probably SPARQL and Gremlin.

xmnlab commented 5 years ago

The notebook server and TypeScript client library should be entirely independent of JupyterLab and useful outside of it.

so it should create a separated server for metadata and each user should specify the metadata server manually (like a settings page) ?

Dessix commented 5 years ago

In response to @saulshanabrook regarding subscriptions not being available in Graphene at the moment, I have had some success using GraphQL_WS in conjunction with AIOHTTP and Graphene for subscriptions.

The support from that project seems seriously finicky in some areas, with the Flask implementation seemingly leaking all sessions, but the AIOHTTP variant seems to work acceptable in my testing.

bollwyvl commented 5 years ago

On Fri, Jan 11, 2019, 15:13 Zoey notifications@github.com wrote:

In response to @saulshanabrook https://github.com/saulshanabrook regarding subscriptions not being available in Graphene at the moment, I have had some success using GraphQL_WS https://github.com/graphql-python/graphql-ws/ in conjunction with AIOHTTP and Graphene for subscriptions.

Right, my finding was that GraphQL_ws is pretty much good to go. Thanks for the corroboration!

The support from that project seems seriously finicky in some areas, with

the Flask implementation seemingly leaking all sessions, but the AIOHTTP variant seems to work acceptable in my testing.

Luckily, tornado's model almost exactly matches other async models, without any of the gevent black (green?) magic.

While I'll agree the GraphQL implementation doesn't really matter, I'm just going to highlight the challenges end (science, business, education) users have been having in even installing a sane environment for JupyterLab's hidden node dependency, much less running its sole function (webpack). If a node-based deployment can be gotten to a yarn-like single file, such as the one bundled with lab, and never even touches npm, and is installable with pip, it could be viable for the average user.

So back to my original point: starting with a versioned...

GraphQL sdl
serialization json schema
black box conformance suite

...will concentrate the discussion on the types, and not on language/vendor-specifics. If the reference implementation (and test suite) is node-based, so be it!

saulshanabrook commented 5 years ago

so it should create a separated server for metadata and each user should specify the metadata server manually (like a settings page) ?

@xmnlab Yeah I think that would be a fine way to do it.

As a default, it might make sense to proxy a metadata server through the notebook server using something like https://github.com/jupyterhub/jupyter-server-proxy and start that with the notebook.

I see the ideal default workflow being:

User pip installs some python package for the server
User installs the jupyterlab extensions
User starts up jupyterlab, this starts up metadata server behind the proxy
User opens jupyterlab, metadata client connects to default metadata server behind proxy

I think starting out with the metadata server running separately and having the user input the address in the client makes sense. Then we could always add the proxy later and add a default.

saulshanabrook commented 5 years ago

@xmnlab It looks like with the apollo server you define your mutations and then provide a custom javascript function for the definition. So at first, it seems like we could just story everything in memory if that's easiest in the server (https://blog.apollographql.com/react-graphql-tutorial-mutations-764d7ec23c15#f370)

xmnlab commented 5 years ago

@saulshanabrook very nice! I think in-memory should work for now! I will take a look into that, thanks!

rgbkrk commented 5 years ago

Yeah in-memory is great as you can focus on the API you're exposing via the schema for the types, queries, mutations, and subscriptions.

xmnlab commented 5 years ago

I am trying to connect the apollographql server with jupyterlab extension using jupyterlab-server-proxy ... but something is not clear to me. the example used on its documentation assumes that the user already has the server installed. in our case we need to install the apollographql server (on nodejs). what is the recommend approach?

should the script for jupyterlab-server-proxy run yarn install before run the server?
should the apollographql server be called manually by the user? and it will be linked inside jupyterlab by jupyterlab-server-proxy (in some way)?

bollwyvl commented 5 years ago

Let's not have any more runtime npm installation hijinks, please!

As a user (and admin) i'd like to be able to "pip" or "conda" install "jupyter-metadata-server", get on an airplane, and start the thing, do some annotating or local dataset browsing.

As mentioned earlier, the portable, single-file yarn redistributed with jlab is a great approach, and possibly the only sane one. This has the nice side effect of ferreting out huge, hidden, barely-managed binary dependencies (I'm looking at you, puppeteer).

I think once that happens, the server extension I pip installed above would take care of depending on and setting up the proxy, as well as starting the server.

Once you add one or more federated endpoints, it should just be a single jupyter_notebook_config.json change, e.g.

"MetadataManager": { "metadata_providers": { "": {enabled: true}, // default, can be disabled here "https://hub.example.com/graphql": {enabled: true} } }

Of course, if it's one and done, local OR a single remote, it can be easier, but I'd still like us to try to support the n-providers case, as it's one of the things GraphQL can do very well.

saulshanabrook commented 5 years ago

In the meeting today, Brian was articulating a metadata explorer UI that exposes the links between objects. For example, if you have a dataset, you should be able to click on it's author and be taken to a person page that describes that author and also shows all the datasets who list them as authors.

So we need to implement a generalized linked data display and edit mechanism. This is pretty huge, it's like a generic ORM that is flexible enough to handle many different types of linked data. Does anyone have examples of existing tools that do this? With schema.org types of data? My experience with this sort of thing has been with the auto generated admin in the Django web framework. You specify your models and it will create custom edit pages that work with the relational data:

In that framework, it isn't all auto generated though. You have to do a lot of manual work to tell the admin the best way to show the fields.

Of course, for an MVP we could hard code in some set of fields and relations and hard code in the UI and all the proper editing capability. But it seems like in the end we need to support all types of relationships and types in schema.org.

bollwyvl commented 5 years ago

One approach would be very JSON-forward, using something like react-jsonschema-form and leveraging the JSON-LD-conforming nature of schema.org data.

The rough order of business in the browser would be:

get some data back from GraphQL that's got some shape that doesn't violate JSON-LD
- ideally, the serializer would just always emit valid JSON-LD
expand with a context...
- get a canonical representation (e.g. Dataset becomes https://schema.org/Dataset)
frame the content into an arbitrary shape
- framing is kind of like GraphQL, but can traverse links in arbitrary directions into a desired JSON shape.
- We'd write one frame for each of the schema.org types, but could start with, say Person, DigitalDocument, DataSet, etc.
For each of our type JSON schemas
- If the re-framed document matches any of our typed JSON schema
- show a UI, which can be customized with rjsf (presumably with any old component that isn't firmly wedded to a data layer e.g. redux, graphql)

The nice thing about this approach is it would scale to vocabularies other than schema.org: just stuff in some more frames and schemas. Adding a new extension (perhaps a more limited form of mimeExtension) would make it easy to extend.

The shortcoming of this approach is that you can end up with the "Jenkins" problem, where you have pages and pages of empty form fields. But I think if we're mostly interested in displaying metadata, not creating it, then discoverability of what isn't yet defined isn't as important. Also, because all the data models involved are highly self-describing, there's plenty of stuff to send to the inspector for live documentation, etc.

The other tack to take would be to repurpose the data explorer itself to do pivot-table style metadata. For a given type, we are saying there are a knowable number of properties it might have, the target of each of which will have one (or more) types. Probably useful to consider in its own right!

bollwyvl commented 5 years ago

@ian-r-rose had started a demo of using rjsf over on #5892:

screenshot from 2019-02-02 00-55-22

Note there are some current limitations that conflict with some stylistic choices made in our schema implementations e.g. type: ["integer", "null"], but these can be changed/transformed to work with some hopefully-soon-to-be-released features e.g. oneOf. This also gives you a place to put more docs, or to achieve more reuse. But generally, it already does the thing.

Adopting it and building some momentum behind modelling schema that's not only rigorous but also captures the user value of the data at hand seems very powerful... as an extension author, if I can delegate some complex but predictable UI over to a core feature, i'd do it in a heartbeat vs writing a bunch of form stuff myself.

saulshanabrook commented 5 years ago

One approach would be very JSON-forward, using something like react-jsonschema-form

How do schema.org and JSON schema compare in expressiveness? Is one a superset of the other?

schema.org Data Model:

We have a set of types, arranged in a multiple inheritance hierarchy where each type may be a sub-class of multiple types.

We have a set of properties:

each property may have one or more types as its domains. The property may be used for instances of any of these types.

each property may have one or more types as its ranges. The value(s) of the property should be instances of at least one of these types.

This is most clear to me in their JSON LD representation. It also helped me to reference the "meta" schema.

Maybe the better question is: "How should JSON LD and JSON Schema relate to one another?" Because it seems that JSON LD is able to represent schema.org well, but also other schemas. Here is a link to a discussion about JSON LD and JSON Schema, not sure about the conclusion: https://github.com/json-schema-org/json-schema-spec/issues/309

bollwyvl commented 5 years ago

The two are orthogonal. They share the property that they are both expressed, unsurprisingly, in JSON. Both can be applied to JSON documents that are unaware of the schema/context. Both have implementations in many languages, but given there are multiple versions of each of them, there can be implementation differences.

A Json schema describes what a document must/can look like: e.g. the top level document must specify a property "type", and the value must be "Dataset".

A json ld context describes what a document's values actually mean: e.g. "type" means "http://www.w3.org/1999/02/22-rdf-schema#type", and "Dataset" means "https://schema.org/Dataset".

LD generally doesn't care about the actual structure unless explicitly told to: for example, type can be a list or single value. It used to be that you couldn't have multiple meanings of the same term in the same document (e.g. blood type and MHC type n one medical record) but as of 1.1, if you do know something about the structure, you can handle that without changing the legacy schema.

Where the two intersect is framing: if done properly, you can take arbitrarily-shaped JSON, get its meaning out, and put it back into a shape that your UI/algorithm/database needs.

bollwyvl commented 5 years ago

Also found this:

https://github.com/vazco/uniforms

Which can infer forms from JSON and GraphQL schemas. No doubt it lacks support for some of the more weird edge cases that either schema system has, but it seems nicely put together.

bollwyvl commented 5 years ago

Well, this seems freakishly useful:

https://github.com/google/react-schemaorg

Based on:

https://github.com/google/schema-dts

You're not going to be putting any ANNO terms (or any other vocabularies) on there, out of the box, and it's a bit odd that it's driven off the nturtles representation rather than the JSON-LD one (not a fun parser to get right) but heck, they did the work, and it looks great!

Of course, one of the big ideas is that physical data representation didn't have to know about any @ this and that: business, but clean, canonical, stuff is great.

bollwyvl commented 5 years ago

My bad: you can generate with whatever context you want:

https://github.com/google/schema-dts/blob/master/src/cli/args.ts

But it still only parses the vocabulary from s.d.o:

https://github.com/google/schema-dts/blob/master/src/triples/reader.ts

This suggests it might be possible to make a jupyter-dts from our context, and however we reconcile the differences between s.d.o and anno, but might require a fork.

xmnlab commented 5 years ago

@bollwyvl

As a user (and admin) i'd like to be able to "pip" or "conda" install "jupyter-metadata-server", get on an airplane, and start the thing, do some annotating or local dataset browsing.

we are working on this WIP PR: https://github.com/jupyterlab/jupyterlab-metadata-service/pull/6

we added a jupyter-server-proxy lib that ships all js files and installs the graphql server into jupyterlab using pip.

"MetadataManager": { "metadata_providers": { "": {enabled: true}, // default, can be disabled here "https://hub.example.com/graphql": {enabled: true} } }

it seems interesting. do you have a suggestion (or reference) about how to change jupyterlab config file? or should be something like just read the file ... add / update the data ... and write it back?

Eyas commented 5 years ago

@bollwyvl

My bad: you can generate with whatever context you want: https://github.com/google/schema-dts/blob/master/src/cli/args.ts But it still only parses the vocabulary from s.d.o:

I actually just submitted google/schema-dts#14 which was requested by a few people to support a totally custom vocabulary altogether, so this might help. It has a few limitations described in the PR (takes a single URL rather than a set of layers, expects schema.org DataTypes, and (probably the most limiting) still expects a "Thing" type to be defined.) Depending on use cases people envision for this, however, I'm happy to accept a PR or just a specific feature request.

bollwyvl commented 5 years ago

**Lots of action here!

@eyas That's very cool stuff! I hope that PR makes it through!

A brief introduction: we want to use schema.org @types inside of JupyterLab and a companion to the Jupyter Server, both of which are implemented in TypeScript.

We're also likely going to need some novel @types, like Kernel, Notebook... But maybe also NotebookCell and NotebookCellOutput. There are other discussions #7 to get some of our stuff into schema.org proper, but we can reasonably say that we "control" them, and could make them schema.org Things.

We also need the rigor of the W3C Annotation Vocabulary to provide rich commenting (all the selectors, etc.)... and probably need to extend them, too. This will be tricker, as that's OWL-based.

Tying these together, we have a few known-@type-driven UIs (card-based annotation) we also need freeform UI (probably narrative-heavy forms per @type, and more compact tables/trees for compatible lists of like @typed things).

expects schema.org DataTypes, and (probably the most limiting) still expects a "Thing" type to be defined.

I'll think a bit about those limitations, as I guess we'd need to think about it some...

Eyas commented 5 years ago

Given this use case, one improvement that might work for your usecase is extending the CLI to request .nt files from two URLs, layering the triples of each on top of one another.

That actually has general use cases in the Schema.org realm, where you might want to use the "basic" schema.org definitions along with the life-sciences extension only. Right now the CLI allows you to either pick the pre-flattened all-layers file, or the basic file, but not pick individual layers.

bollwyvl commented 5 years ago

to request .nt files from two URLs, layering the triples of each on top of one another.

Right, this sounds about like what i was imagining. We'd probably want a configurable wellKnown.ts that could be aware of both s.d.o semantics and owl and prov (and whatever else anno needs).

I guess I do have to ask the question of why parse triples instead of doing JSON-LD directly? It seems like

Further, we'd want everything checked in and static (or at least submoduled). Could the CLI support a config file?

Ours would end up being something like:

{
  "reader": [
    {
      "url": "schema:version/3.4/all-layers.jsonld"
      # some more stuff
    },
    {
      "url": "https://dvcs.w3.org/hg/prov/raw-file/tip/ontology/prov-o.ttl"
    },
    {
      "url": "github:w3c/web-annotation/raw/gh-pages/vocab/wd/ontology/oa.jsonld",
      # some more stuff
    },
  ],
  "generator": {
    "output": "dist",
  }
}

Eyas commented 5 years ago

We'd probably want a configurable wellKnown.ts that could be aware of both s.d.o semantics and owl and prov (and whatever else anno needs).

Makes sense. Biggest gaps seem like rdfs:range, rdfs:domain? Should be straightforward.

Further, we'd want everything checked in and static (or at least submoduled). Could the CLI support a config file?

A config file makes sense. The CLI npm package can also be included, by the way, and individual functions (e.g. WriteDeclarations) can be imported and called in whatever custom way you want. I agree a config file for the CLI makes sense though.

I guess I do have to ask the question of why parse triples instead of doing JSON-LD directly?

I had two implementations for this and ended up sticking with .nt for a few reasons. Parsing triples is weird but it's pretty straightforward from there. The nice things about triples once they're parsed is that they're very composable and pretty close to the metal as far as what relations they're describing.

While it's nice to take JSON-LD in, it's not particularly more ergonomic to handle it's "@graph" definitions (you could iterate over keys, at which point you might as well have parsed triples). You'll also need to resolve "@context", etc. Nothing is inherently hard, it just didn't seem worthwhile in terms of trade-offs. Happy to reconsider if things change.

ceteri commented 5 years ago

A few critiques about points made above. There seems to be confusion between metadata vocabulary and format. Schema.org and others related to it (e.g., VoID, SAGE, etc.) are controlled vocabularies that help define entities and relationships for representing metadata about datasets and their use cases. Generally a project will define a local vocabulary which links to Schema.org or others, and so by their nature these are extensible -- you really never need to just pick one. Historically, there are several formats used and many of the popular open source libraries are quite good at converting between formats. JSON-LD is a good format for use cases that need to be machine readable, while Turtle is arguably the most compact format for use cases that need to be human readable, although it's trivial (~2 lines of Py) to convert between them. The popular upper ontologies such as Library of Congress, DBpedia, etc., will tend to use SKOS for organizing what's being represented -- although ultimately SKOS is built atop OWL, which is built atop RDF, so again these all interoperate well.

ceteri commented 5 years ago

It's curious what GraphQL is being used here for metadata services -- other than perhaps it already has use elsewhere in Jupyter? While there's a notion of "knowledge graph about datasets" in the long term roadmap here, that's definitely not what the first part of the "GraphQL" name implies :) It's a protocol for services, as an alternative to REST, gRPC, etc., and especially good when, hypothetically, the same corporate entity controls the release cycles for both client and server and they're interested in optimizing API overhead for gazillions of ads served daily. However, GraphQL seems more about data served as trees and lists than about graphs; see any graphs in its examples? Also, it's not particularly good for serving metadata. For machine readable metadata, JSON-LD would be the most likely choice, and there is already precedent, e.g., JSON-LD markup in web pages used for metadata that search engines need. That would be much simpler for the consumers of this metadata service if it were a JSON-based service.

saulshanabrook commented 5 years ago

Thanks Paco for explaining this. I started to sketch out a client side API that would be vocabulary agnostic here. A "Metadata Provider" would have to have a get method that takes a URL and returns some JSON-LD about that URL.

In that proposal, GraphQL would not be part of the core API, although you would be free to implement a provider that uses GraphQL.

ceteri commented 5 years ago

In terms of open standards for metadata services, it's odd not to find any mention here of Egeria https://egeria.odpi.org/ and the ASF + ODPi work on open source and open standards for that. Has there been any discussion of having an adapter? https://egeria.odpi.org/open-metadata-publication/website/open-metadata-integration-patterns/

ceteri commented 5 years ago

Thank you @saulshanabrook I'll add comments over there regarding use cases for vocabularies.

ceteri commented 5 years ago

The other general observation here is that there's been a lot of discussion about using standards and tools that come out of open data, as a guide for how to structure this metadata service for JupyterLab. Those are good to leverage, but they aren't definitive. One caveat is that so many of the use cases for Jupyter will not use open data.

Instead, it's probably best to go into this with a notion of "tiered access":

some datasets are open
some datasets must have privacy-preserving aspects (PII obscured with hashes, differential privacy, etc.)
some may have open metadata that qualified parties can request access via a data steward

So it's important to keep that distinction between open metadata and open data.

saulshanabrook commented 5 years ago

I am going to close this for now, since we have a solid base for read only viewing of metadata.

jupyterlab / jupyterlab-metadata-service