Reflection on current project state, and a proposal for a metaschema

(EDIT March 8, 2019) Turns out what I proposed below is more-or-less RDF (glad to find out that people have already solved these sorts of problems for us!). We're now working on a proposal to offer an RDF/JSON-LD extendable schema served through GraphQL for JupyterLab. I'll update this post once we finish thinking through our next proposal.

Reflection on current project state

A main goal of this project is to expose the relevant parts of schema.org's schema as a GraphQL service (see https://github.com/jupyterlab/jupyterlab-metadata-service/issues/4). This is to facilitate the storage and retrieval of various "rich context" data within JupyterLab.

As of last week, we have a minimal working prototype of this, which contains a GraphQL server and an interface for querying the metadata of a dataset (i.e. a concrete implementation of this small part of schema.org). We also use this same GraphQL server to store users' comments on files in JupyterLab; more on that later.

The goal of that minimal working prototype was twofold: (1) to have something working to demo to the stakeholders of this project, and (2) to explore all relevant technologies by building a "vertical slice" of the software stack.

Now that we've built this vertical slice, I have formed some opinions on how we should change our approach. I hope this post can start a discussion!

Proposal for a metaschema

So, we already use two schemas in our minimal prototype:

Schema.org (as already mentioned) Used for dataset metadata, currently.
W3C Annotation Data Model Used for users' comments, currently.

Two points to this:

We already have two schemas! Since our goal is to give JupyterLab a "rich context", shared, extendable GraphQL service, it is reasonable to expect there will be more schemas needed in the future. Also, we have to consider that extensions may want to inject their own schema into this service. How might they do that?
Schema.org is huge. We probably don't want to explicitly write out the entire schema (that's a lot of code--although it could be auto-generated surely), despite that's how our first approach began (see our Dataset definition here). Also, we are not following Schema.org exactly, e.g. every property's value should be an array according to schema.org, which we do not currently allow. Also, we would need to do some crazy unioning to precisely follow how schema.org allows properties to have "one or more types as its domain" -- it would be a mess. (For detail, see schema.org's data model.) One more note: It is expected that for any given object, most of the its properties will be unused -- thus again, it will be tedious to pass around concrete JS objects with all those fields defined yet mostly unused.

So, that was a description of the problems we've realized. To overcome those problems... I have some ideas below. Most of the ideas below were inspired by @saulshanabrook in one of our meetings. I've merely tried to articulate them further. (Saul, is below more-or-less what you were thinking?)

I propose we come up with a "metaschema". I.e. A schema to describes schemas. Another way to see this is that we would not implement schema.org as a "hard-coded schema", but instead it would be represented as data in the shape of a metaschema. Yet another way to say it: If you wanted to begin supporting a new part of schema.org (say, the FlightReservation type), you would do so be inserting data into the database to document the name of your new type ("FlightReseveration") and to list out the properties it may have. This idea of a metaschema also makes it simple to support multiple schemas together. In general I believe the metaschema solves all the issues mentioned above:

Property's value should be an array: Idea: Concrete properties would be stored as three-tuples (object_id, property_name, value). To have an array of values, you just store multiple tuples with the same (object_id, property_name). At query-time, you could collapse those into an array of values, if you so desired.
Allows properties to have "one or more types as its domain": Every value will really be a two-tuple of (type, value). Every property's definition would contain an array of type specifiers, and we simply have some validator code to ensure you only assign a (type, value) tuple to a property if its type is appropriate (i.e. in the property's array of type specifiers).
It is expected that for any given object, most of the properties will be unused: There will simply be no (object_id, property_name, value) for property_names that don't exist on the given object_id.
Supporting more than one schema: Perhaps every object_id is a two-tuple (object_id, schema_id). Then objects from different schemas just coexist.
Allow extensions to inject their own schemas: This is done by simply inserting new records into the "metaschema".
Bonus Feature! We could also store a description of each property as data in the database, thus it could be queried and shown in the front-end UI. I imagine the UI will become simpler with this metaschema idea, as you could, as another example, query the database for all supported properties of a given type, and also retrieve the types allowed for each property's value.

Well, at this point I hope I've given a lot to either agree or disagree with! Thoughts from the group? (Specifically @saulshanabrook @xmnlab @ellisonbg @bollwyvl)

I do think it makes sense to explore this. A couple of points:

When talking to scientists working in large collaborations this last weekend, they were really excited about this metadata work. However, they immediately said "we would want to add new types for our sciency things, like atoms, brains, galaxies, etc.". IOW, they want extensibility to add new schemas and still have links to objects in existing ones.
It would be helpful to point to each schema file in some standard manner.

JSON-LD seems to be the recommended (Google, origins) standard. It can encompass Schema.org. The latest version, 1.1, is in draft and has some interactive examples on the spec.

We could store a denormalized version of JSON-LD in the backend as well as a denormalized version of context/schema so we know what fields are valid without having to send an HTTP request for the schema.

Too much fun to fully write on my phone!

Yeah, by adopting these two contexts, we've already set up a lot of work if it can't be handled in a mostly automated fashion. Automation, akin to how schema-dts or pythreejs work, is the right path. Using the highest fidelity canonical description (they both publish jsonld of their meta-model) is probably the only way to get this done, and will implicitly handle the multiplicity and inheritance issues.

These would get you to dumb, but type-checkable, classes in many Jupyter languages. But they would be derived from the canonical definition. Combining the JSON-LD contexts from SDO/WADM one could derive a canonical serialization format, and only have to explicitly handle conflicts like Person, Organization and Dataset.

So that's Read and Print... what about Execute? Resolving these types is another task, and should be pretty decoupled from the contexts we implement as a concrete schema.

It would be folly to ignore actual graph implementation that expose a "real" graph query language.

For example, a schema provider backed by a full-on graph store could just build sparql/gremlin... rdfalchemy and sqlite would be enough for the single user experience. Bigger deployments might already be graphql-aware, like:

https://dgraph.io/ https://edgedb.com/

But would be harder to configure. At any rate, not only might you have multiple kinds of storage, you might use more than one storage/resolver on the same server for the same type. So this will take some serious thought, especially when it comes to things like pagination across multiple sources.

Both extensible schema and even extensible types/unions are important. I started on adding extensible schema (new types, which can reference/extend existing ones) on my prototype:

https://github.com/deathbeds/jupyter-graphql/pull/3/files#diff-8a9380c1249ac99297a763e1f9a4ee77

A pip-installable extension can add some things (query, mutation, subs) defined by graphene types.

Further (unpushed, for some reason) work adds an example of extending a type by adding fields, and while it's really ugly, using python type("",(),{}) magic, it does work.

The first example I tried extends notebook metadata, adding SlideShowMetaBase to the CellMetaData type. The contents plugin advertises this as another entry_point. I would rather use a union or something, but there's no multiple inheritance, so it would really one work in specific cases.

Here is a possible implementation story, inspired by a conversation with @dcharbon and looking again at the JSON LD spec.

User Stories

Users will want to click on a resource and see relevant metadata about it. They will want to be able to edit that metadata as well as click on resources in the metadata to see metadata about those.

Metadata providers want to be able to provide metadata for the user for certain resources. The metadata they provide will have different fields that likely come from some type specification for the type of object they are describing. As users edit the metadata, they need to be able to be notified of these changes so that they can update where they store the metadata.

Data model

A resource in this context is a Linked Data node, as laid out in the JSON LD spec. So it has some @id that is it's IRI/URL, as well as a number of @typess, and other attributes.

So as a Metadata Provider, you have to define a way to query yourself to see if you have metadata about a resource, and if you do, to return that in the expanded JSON LD syntax. You also have to define to update yourself with an updated version of the metadata.

The metadata explorer will see what the active resource is and query each of the providers to see if it has data about that resource. The first that does will be displayed to the user. Primitive types will be displayed without links, but types that link to other IDs will be displayed as links. All existing fields are editable, but the user cannot add new fields. The removes the need to process the type at all to understand what all valid field for it could be. The ability to add new fields from the UI could be added at a later date. If a user edits a field, the provider that had that metadata gets notified with the updated object.

This implementation allows us to integrate our existing graphql backend, but the core APIs would not depend on it and allow users to define other backends however they want to provide and persist metadata.

The major technical hurdles I see here are creating proper editable UIs given arbitrary JSON LD nodes and communicating the proper structure of the nodes that the data provider should return.

I like this idea - this is really the type of problem that JSON LD was invented to solve. Questions and thoughts:

Working with JSON LD can be a bit painful. I am a bit hesitant to force this on all JLab extensions wanting to work with metdata. Any libraries to make this less painful?
I am not quite clear on how you are envisioning the backend(s) for this working. Can you talk more about that. I would be hesitant to have multiple different metadata backends.
On you question of rendering UIs for arbitrary JSON LD nodes, I think we will be saved by the relatively minimal type model of JSON. The most difficult would be strings, which could be a range of different things (command separated lists, URLs, text, etc.)

Working with JSON LD can be a bit painful. I am a bit hesitant to force this on all JLab extensions wanting to work with metadata. Any libraries to make this less painful?

I have seen the schema-dts library which lets you generate TypeScript types for different Schema.org types. We are pushing the boundaries here, so we would probably end up having to create any tools we need. It doesn't seem that hard to create JSON LD, like this example:

{
  "@context": "http://schema.org/",
  "name": "Jane Doe",
  "jobTitle": "Professor",
  "telephone": "(425) 123-4567",
  "url": "http://www.janedoe.com"
}

There is also the standard jsonld.js library which lets you translate between different forms of JSON LD.

I am not quite clear on how you are envisioning the backend(s) for this working. Can you talk more about that. I would be hesitant to have multiple different metadata backends

Sure. This is the interface I am thinking about:

type LinkedData = {
    '@id': string,
    [prop: string] : any
}

interface IMetadataProvider {
    // Maybe this method is not required
    listResources(): Promise<Array<URL>>;

    getResource(resource: URL): Promise<LinkedData>;
    updateResource(data: LinkedData): Promise<void>;
}

Multiple providers would be useful if you already have your metadata stored somewhere, and don't wanna replicate it into a local GraphQL database. Instead, you can access your existing store however you like client side and as long as you can query it about resources. It also provides an abstraction layer over graphql, so if we wanna move metadata storage into the real time data store, we can do this by implementing a new provider, without changing the metadata extension.

there is an implementation of graphql server using json-ld concept: https://www.hypergraphql.org/ ... not sure yet if it could be helpful.

about graphql layer, one thing that we need to keep in mind is it is strong typed ... so work with generic structure doesn't work very well ... so that is why I am investigating graphql-schema-org.

As @xmnlab mentions, with GraphQL's typed schema, you can't extend the schema at run-time. I.e. If we wanted RDF's notion of "say anything about anything", then we'd need to look elsewhere for a solution. (Right?)

So, maybe we should step back and say: "How extendable do we actually want this metadata service to be?"

The way I see it we have a few options:

Extendable only by modifying to code. I.e. "You want to extend it? Send in a PR to this repo."
Extendable by pip install my_super_jupyter_metadata_schema and restart JupyterLab. This is analogous to @bollwyvl's PR he posted above (albeit he is using Graphine instead of Apollo). Within this option are two sub-options: (1) the ability to extend only by adding a new top-level schema, and/or (2) the ability to extend any existing type or add types to an existing schema.
Extendable at run-time. I.e. Pure RDF (as I understand RDF...). The idea here is at runtime, as a user, you could say "You know, my files really need a favorite color." So you go edit each file's metadata to add a property named favorite_color, and that persists and is visible to everyone else as well.

Option 1 is obviously not what we want.

Option 2 is interesting... if we choose this option, there are of course many more questions to answer, but GraphQL could do this as @bollwyvl has already shown (via python-graphine).

Option 3 is also interesting. It takes more of the semantic web mindset. @dcharbon has thoughts on how to go about this, which he has partially shared with us.

@ellisonbg Which option above is most inline with your thoughts?

Extendable at run-time. I.e. Pure RDF (as I understand RDF...). The idea here is at runtime, as a user, you could say "You know, my files really need a favorite color." So you go edit each file's metadata to add a property named favorite_color, and that persists and is visible to everyone else as well.

I wrote up some notes explaining this idea more, to articulate how we might support arbitrary types of schemas. I think for now we are going to work on getting the current approach working with editing, and then eventually try to create a JupyterLab extension API for this kind of system:

Goals:

Allow different metadata providers to expose their own objects in their own schemas
Have dynamic UI that can render these different backends
For the first version, only support reads

JupyterLab Metadata Extension:

Calls get on the MetadataService with the URL of the active document.
Have a default way of rendering each schema type. So that it knows how to display things like strings, lists, and integers.
- Any fields that link to other URLs show up as links
(Later) Have extensible way to add new views for different resources. So an extension author can say “Hey I know how to view a Person, let me do that” and so then when it finds a person it won’t use its default renderer, it will instead hand it off to that plugin

JupyterLab Metadata API:

Have a MetadataService where you can register a MetadataProvider.
Each MetadataProvider should have a get method that takes in a url of a resource and returns back JSON LD about URL. This should be the “Expanded” JSON LD (https://json-ld.org/playground/), so that we know the full path of each key
The MetadataService also has a get method that takes a URL. It calls get on all the providers and back a merged version of the JSON LD

JupyterLab Metadata Server:

Have one table, like that of the “Table” view (https://json-ld.org/playground/) of JSON-LD. It would have three fields:
- Subject: The URL of the object we are talking about (“http://example.org/cars/for-sale#tesla”)
- Predicate: the relation (“http://purl.org/goodrelations/v1#name”)
- Object: The value of this field (“Used Tesla Roadster”)
This would be registered in the frontend as a MetadataProvider

Here's an example (trivial example, but it shows the point) of how controlled vocabularies are referenced in use cases, i.e., in an example integration of JupyterLab Metadata Service: https://github.com/Coleridge-Initiative/adrf-onto/blob/master/adrf.ttl Note that almost always there are multiple vocabularies being both blended and extended.

Thank you, that's very helpful to see. I haven't used Turtle at all before now. It would also be helpful to see how that matches up with a particular instance of some data at some point.

We're building out examples from the ADRF framework -- the NYU project which will use these data registry and metadata service features in Jupyter -- that use Turtle and JSON-LD interchangeably, depending on "what" is reading the file. Will share those with the project here.

From an AI practitioner standpoint, I would expect my peers to use Turtle in human-curated definitions.

Also, the wiki in that ADRF repo above links to more details and resources about Turtle, JSON-LD, other vocabularies, etc.

Here's an example of a formal metadata description for a dataset, based on training data used in the Rich Context Competition:

:dataset381
  rdf:type dctypes:Dataset;
  dct:title "Established Populations for Epidemiologic Studies of the Elderly Project"@en;
  dct:alternative "EPESE"@en;
  dct:description "A project initiated by the intramural Epidemiology, Demography and Biometry Program of the National Institute on Aging"@en;
  pav:createdOn "1993-02-01"^^xsd:date;
  dct:identifier "8481423"@en;
  foaf:page <https://www.ncbi.nlm.nih.gov/pubmed/8481423>;
  dct:publisher :duke_univ;
  pav:curatedBy :cornoni-huntley_j;
  .

That resolves (again, with ~7 lines of Py) into a graph:

@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dcmitype: <http://purl.org/dc/dcmitype/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix ns1: <http://purl.org/pav/> .
@prefix ns2: <http://xmlns.com/foaf/0.1/> .
@prefix ns3: <http://www.loc.gov/mads/rdf/v1#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset381> a dcmitype:Dataset ;
    dct:alternative "EPESE"@en ;
    dct:description "A project initiated by the intramural Epidemiology, Demography and Biometry Program of the National Institute on Aging"@en ;
    dct:identifier "8481423"@en ;
    dct:publisher <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#duke_univ> ;
    dct:title "Established Populations for Epidemiologic Studies of the Elderly Project"@en ;
    ns1:createdOn "1993-02-01"^^xsd:date ;
    ns1:curatedBy <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#cornoni-huntley_j> ;
    ns2:page <https://www.ncbi.nlm.nih.gov/pubmed/8481423> .

<https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Dataset> a dcmitype:Dataset,
        skos:Concept ;
    ns3:hasRelatedAuthority <http://id.loc.gov/authorities/subjects/sh2018002256> ;
    = <http://dbpedia.org/resource/Data_set> ;
    skos:definition "A collection of tables and metadata, managed by a responsible party"@en ;
    skos:inScheme <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology> ;
    skos:prefLabel "dataset"@en ;
    skos:topConceptOf <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology> .

<https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Topic> a ns3:Authority,
        ns3:Topic ;
    skos:definition "Concepts tied to LOC upper ontology http://id.loc.gov/authorities/subjects.html"@en ;
    skos:inScheme <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology> ;
    skos:topConceptOf <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology> .

<https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology> a skos:ConceptScheme ;
    skos:definition "A mid-level ontology for linked data within the ADRF framework use cases"@en ;
    skos:hasTopConcept <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Dataset>,
        <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Topic> ;
    skos:prefLabel "ADRF Ontology"@en .

This is a small example -- only a single entity represented -- although it shows how a knowledge graph looks.

The following presentation describes how to handle metadata for linked data, data catalogs in practice, leading into knowledge graph work for reproducible science: https://www.slideshare.net/tplasterer/dataset-catalogs-as-a-foundation-for-fair-data

One of the better online specs/tutorials for how to handle this kind of metadata markup is at: https://www.w3.org/TR/hcls-dataset/#appendix_1

Our WIP code example for NYU and knowledge graph in social science research across US fed/state/local agencies is at https://github.com/Coleridge-Initiative/adrf-onto/

What's shown above is an example of to how metadata about linked data for social science, also applies in life sciences, etc. It doesn't illustrate the curation/data stewardship links (next on my TODO list). Even so, note that this level of governance will be applied to datasets and publications throughout the sciences, and given the push for compliance, data privacy, provenance, audits, etc., similar kinds of graph-based data governance are showing up for finance, healthcare, manufacturing, etc. I have a hunch that in talking with Capital One, Two Sigma, Bloomberg about their use of metadata about datasets would look similar to this. Ongoing regulatory compliance efforts will push that point even further. That's why I'm urging Jupyter to consider more of this approach for the Metadata Service.

Then, 2 more lines of Py:

from rdflib.serializer import Serializer
print(graph.serialize(format="json-ld", indent=2))

Can transform that graph into JSON-LD, so that it's easily read by machines (without any special parsing beyond JSON) and also readily exchangeable via APIs:

[
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Dataset",
    "@type": [
      "http://www.w3.org/2004/02/skos/core#Concept",
      "http://purl.org/dc/dcmitype/Dataset"
    ],
    "http://www.loc.gov/mads/rdf/v1#hasRelatedAuthority": [
      {
        "@id": "http://id.loc.gov/authorities/subjects/sh2018002256"
      }
    ],
    "http://www.w3.org/2002/07/owl#sameAs": [
      {
        "@id": "http://dbpedia.org/resource/Data_set"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "A collection of tables and metadata, managed by a responsible party"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#prefLabel": [
      {
        "@language": "en",
        "@value": "dataset"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Topic",
    "@type": [
      "http://www.loc.gov/mads/rdf/v1#Authority",
      "http://www.loc.gov/mads/rdf/v1#Topic"
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "Concepts tied to LOC upper ontology http://id.loc.gov/authorities/subjects.html"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset381",
    "@type": [
      "http://purl.org/dc/dcmitype/Dataset"
    ],
    "http://purl.org/dc/terms/alternative": [
      {
        "@language": "en",
        "@value": "EPESE"
      }
    ],
    "http://purl.org/dc/terms/description": [
      {
        "@language": "en",
        "@value": "A project initiated by the intramural Epidemiology, Demography and Biometry Program of the National Institute on Aging"
      }
    ],
    "http://purl.org/dc/terms/identifier": [
      {
        "@language": "en",
        "@value": "8481423"
      }
    ],
    "http://purl.org/dc/terms/publisher": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#duke_univ"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@language": "en",
        "@value": "Established Populations for Epidemiologic Studies of the Elderly Project"
      }
    ],
    "http://purl.org/pav/createdOn": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "1993-02-01"
      }
    ],
    "http://purl.org/pav/curatedBy": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#cornoni-huntley_j"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@id": "https://www.ncbi.nlm.nih.gov/pubmed/8481423"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology",
    "@type": [
      "http://www.w3.org/2004/02/skos/core#ConceptScheme"
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "A mid-level ontology for linked data within the ADRF framework use cases"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#prefLabel": [
      {
        "@language": "en",
        "@value": "ADRF Ontology"
      }
    ]
  }
]

Note that I've pretty-printed here to help visualize it, though this JSON-LD would compress during API calls.

@ceteri Thanks for these examples, this is really helpful.

So I could see the API as taking the ID of a entity and returning the flattened JSON-LD for that entity. Then we could generate a UI around these these field mappings.

console.log(myDataProvider.get('https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset381'))

{
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset381",
    "@type": [
      "http://purl.org/dc/dcmitype/Dataset"
    ],
    "http://purl.org/dc/terms/alternative": [
      {
        "@language": "en",
        "@value": "EPESE"
      }
    ],
    "http://purl.org/dc/terms/description": [
      {
        "@language": "en",
        "@value": "A project initiated by the intramural Epidemiology, Demography and Biometry Program of the National Institute on Aging"
      }
    ],
    "http://purl.org/dc/terms/identifier": [
      {
        "@language": "en",
        "@value": "8481423"
      }
    ],
    "http://purl.org/dc/terms/publisher": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#duke_univ"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@language": "en",
        "@value": "Established Populations for Epidemiologic Studies of the Elderly Project"
      }
    ],
    "http://purl.org/pav/createdOn": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "1993-02-01"
      }
    ],
    "http://purl.org/pav/curatedBy": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#cornoni-huntley_j"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@id": "https://www.ncbi.nlm.nih.gov/pubmed/8481423"
      }
    ]
  }

Excellent, for example that could be rendered as nested tables in a reasonable compact way. A user could click to follow the links of IRI dereferencing to understand what the type definitions mean. In other words, click through to get "http://purl.org/dc/terms/title" which any browser can do simply.

This would be one way to view a graph of metadata used to answer the questions that @ellisonbg enumerated: Note that each of those entities has other fields. This is a start for how I could see the metadata used in practice.

@tonyfast does that fit with what you're thinking too?

Another general note about open standards for metadata about datasets used in science:

I'd recommend reading the FAIR Data Principles, originally described in

Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).

The FAIR acronym stands for Findable, Accessible, Interoperable, Reusable -- as guidelines for data infrastructure to support reproducible research.

While that article does not specifically mention Jupyter, reading between the lines there's so much overlap between that widely accepted set of FAIR practices and intentions for the JupyterLab metadata service.

We're probably in agreement, but there is a lot of different language be used to talk about this problem so I cannot say exactly.

I see the analysis project as graph that can be searched. It's contents - virtual file systems - is enriched by type information & typology. Graph and structured databases will execute queries; and the query language bounds the questions we can ask. From this perspective, we can ask questions composed in the specified query language.

For example, this is a sqlite database of a couple hundred notebooks we could ask questions like:

A graph database could answer different questions, and a more general tool would query all types of files as data. There could be so many way to query, "blood sample please?".

At this point though, I think I'm stuck on how did the types get there? Who is annotating information with metadata? Where is the knowledge coming from? How do we know what types are salient?

From the json-ld perspective, I think that directories should permit multiples contexts with "*.jsonld" files. These contexts establish a shared language for resources across the project. These files would include documentation IRIs, demo IRIs, web types, and python type IRIs. A simple example would have production and research contexts. Context is a feature toggle, kinda sorta.

A lot of the metadata desired in the diagram could be recorded as a stream of context information. Which I think raises the important question again, "How are the types being annotated?"

Based on some experiments with pyld it appears that both web annotations and metadata can be represented by a @graph in a json-ld expansion. The notebook below experiments with composing types using python annotations and rdflib objects.

https://gist.github.com/tonyfast/443c7b5b23449ef9fe7b024538ff2261

@ceteri I see you have one piece of linked data here as an example. Do you have a larger set readily available? I am putting together a little prototype for the metada explorer and would like to use some of your real-ish data if possible.

EDIT: Ideally it would be good to have an example with multiple entities that link to each other.

@saulshanabrook @tonyfast: last weekend at Sci Foo, @ellisonbg and I sketched out functionality for a minimum viable UI to demonstrate the Metadata Explorer.

1/ Let's assume we're starting with some file type that represents the metadata for a specific dataset. JupyterLab knows to launch this UI based on the file type -- or mime type for metadata from a URI.

For example, let's say a dataset has been registered through the Data Explorer, e.g., a Jupyter notebook knows where to lookup metadata for it.

2/ pull down the metadata, render it with hyperlinks on any URI among the properties in the metadata

For example, the EPESE dataset may have a DOI identifier that links out to:

We also must follow the metadata references to papers that cite usage of the dataset:

https://academic.oup.com/aje/article/157/7/633/69850

For each author, there will be pages on ORCID, Google Scholar, ResearchGate, etc.:

https://scholar.google.com/citations?user=FrqMsdQAAAAJ&hl=en&oi=sra

A dataset will also have a Data Provider, such as:

https://www.icpsr.umich.edu/icpsrweb/

3/ So the UI should provide means for a user to follow those links. Let people click through to the next page's metadata.

4/ Generally when we're following links from a web page, those are URI that resolve out to HTML, which the browsers render. Here we are following a secondary layer of metadata that's often embedded in the web pages -- sometimes called micro data.

Not all providers will have endpoint URIs to obtain just the metadata.

We can build scrapers/gateways for enough of them to demo the UI. Also, there aren't a large number of these kinds of sites. We can also work with the providers to get additional endpoints available -- that's current dialog among scientific publishers. I've already talked with a Google eng manager about this, and they'd be interested plus they have the eng resources to support. It's not a lot of work either.

5/ the strategy is to follow these links as much as possible. that would impl a traversal of the graph shown above in https://github.com/jupyterlab/jupyterlab-metadata-service/issues/23#issuecomment-506840526

Next up, I'll develop a better sample file to use, in JSON-LD

Footnote: one way to integrate with the scrapers/gateways would be to register a URI pattern, then when a user clicks a link with that URI pattern in the metadata explorer UI, we use the scraper/gateway instead of simply doing the HTTP GET on the URI.

For example, the HTML results on Google Dataset Search are basically a list of JSON, plus some JavaScript to render it. It's not hard to scrape that kind response and build a metadata gateway for it. The other popular providers, such as ORCID and ResearchGate have embedded metadata (aka "micro data") that we can scrape similarly.

This is a little demo I put together to show a simple linked data explorer https://github.com/jupyterlab/jupyterlab-metadata-service/pull/27

The core of it is a linked data provider that has a function that takes a URL and returns some linked data. We can hook this up to a server extension that serves up scraped data about certain URLs.

@saulshanabrook @tonyfast @ellisonbg here's TTL for an example from the Rich Context Competition which includes 2 research papers, 2 datasets used by them, 7 authors, and the 2 data providers: https://github.com/Coleridge-Initiative/adrf-onto/blob/master/rcc.ttl

:Catalog and :Corpus are collections within the graph for datasets and research publications respectively

Here's that same metadata graph converted to JSON-LD (since GH doesn't yet support attaching JSON files?)

[
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset_x001",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Dataset"
    ],
    "http://purl.org/dc/terms/alternative": [
      {
        "@language": "en",
        "@value": "CSFII"
      }
    ],
    "http://purl.org/dc/terms/description": [
      {
        "@language": "en",
        "@value": "1-day dietary intakes of men 19 to 50 years of age living in the United States in 1985"
      }
    ],
    "http://purl.org/dc/terms/identifier": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://doi.org/10.3886/ICPSR21960.v1"
      }
    ],
    "http://purl.org/dc/terms/publisher": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#usda"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@language": "en",
        "@value": "Continuing Survey of Food Intakes by Individuals"
      }
    ],
    "http://purl.org/pav/createdOn": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "2009-01-27"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/21960/version/1"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#tiffany_l_gary",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "Tiffany L Gary"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://orcid.org/0000-0001-9843-1084"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Catalog",
    "@type": [
      "http://www.w3.org/ns/dcat#Catalog"
    ],
    "http://purl.org/dc/terms/language": [
      {
        "@id": "http://id.loc.gov/vocabulary/iso639-1/en"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@value": "ADRF Data Catalog"
      }
    ],
    "http://www.w3.org/2000/01/rdf-schema#label": [
      {
        "@value": "ADRF Data Catalog"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ],
    "http://xmlns.com/foaf/0.1/homepage": [
      {
        "@id": "https://coleridgeinitiative.org/"
      }
    ],
    "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset481"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset_x001"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication340",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ResearchPublication"
    ],
    "http://prismstandard.org/namespaces/basic/2.0/publicationDate": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "2009-11-11"
      }
    ],
    "http://purl.org/dc/terms/creator": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#virginia_w_chang"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dawn_e_alley"
      }
    ],
    "http://purl.org/dc/terms/identifier": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://doi.org/10.1093/gerona/glp177"
      }
    ],
    "http://purl.org/dc/terms/language": [
      {
        "@value": "en"
      }
    ],
    "http://purl.org/dc/terms/publisher": [
      {
        "@language": "en",
        "@value": "The Journals of Gerontology: Series A"
      }
    ],
    "http://purl.org/dc/terms/subject": [
      {
        "@language": "en",
        "@value": "Lipids"
      },
      {
        "@language": "en",
        "@value": "Body mass index"
      },
      {
        "@language": "en",
        "@value": "Weight history"
      },
      {
        "@language": "en",
        "@value": "Metabolic syndrome"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@language": "en",
        "@value": "Metabolic syndrome and weight gain in adulthood"
      }
    ],
    "http://purl.org/spar/cito/citesAsDataSource": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset481"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dawn_e_alley",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "Dawn E Alley"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.scopus.com/authid/detail.uri?authorId=14041087200"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset481",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Dataset"
    ],
    "http://purl.org/dc/terms/alternative": [
      {
        "@language": "en",
        "@value": "NHANES"
      }
    ],
    "http://purl.org/dc/terms/description": [
      {
        "@language": "en",
        "@value": "A program of studies designed to assess the health and nutritional status of adults and children in the United States"
      }
    ],
    "http://purl.org/dc/terms/identifier": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://doi.org/10.3886/ICPSR25501.v4"
      }
    ],
    "http://purl.org/dc/terms/publisher": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#national_center_for_health_statistics"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@language": "en",
        "@value": "National Health and Nutrition Examination Survey"
      }
    ],
    "http://purl.org/pav/createdOn": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "2012-02-22"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.icpsr.umich.edu/icpsrweb/NACDA/series/39"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#national_center_for_health_statistics",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#DatasetProvider"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "National Center for Health Statistics"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.cdc.gov/nchs/index.htm"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#virginia_w_chang",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "Virginia W Chang"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.semanticscholar.org/author/Virginia-W-Chang/40382448"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication",
    "@type": [
      "http://www.w3.org/2002/07/owl#ObjectProperty",
      "http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"
    ],
    "http://www.w3.org/2000/01/rdf-schema#domain": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Corpus"
      }
    ],
    "http://www.w3.org/2000/01/rdf-schema#label": [
      {
        "@language": "en",
        "@value": "publication"
      }
    ],
    "http://www.w3.org/2000/01/rdf-schema#range": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ResearchPublication"
      }
    ],
    "http://www.w3.org/2000/01/rdf-schema#subPropertyOf": [
      {
        "@id": "http://purl.org/dc/terms/hasPart"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher",
    "@type": [
      "http://xmlns.com/foaf/0.1/Person"
    ],
    "http://www.loc.gov/mads/rdf/v1#hasRelatedAuthority": [
      {
        "@id": "http://id.loc.gov/authorities/subjects/sh85089630"
      }
    ],
    "http://www.w3.org/2002/07/owl#sameAs": [
      {
        "@id": "http://dbpedia.org/resource/Author"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "An author of a research publication that uses datasets for research"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#prefLabel": [
      {
        "@language": "en",
        "@value": "author"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#youfa_wang",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "Youfa Wang"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://scholar.google.com/citations?user=cHpphu0AAAAJ&hl=en&oi=ao"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#DatasetProvider",
    "@type": [
      "http://xmlns.com/foaf/0.1/Organization"
    ],
    "http://www.loc.gov/mads/rdf/v1#hasRelatedAuthority": [
      {
        "@id": "http://id.loc.gov/authorities/subjects/sh85066157"
      }
    ],
    "http://www.w3.org/2002/07/owl#sameAs": [
      {
        "@id": "http://dbpedia.org/resource/Data_publishing"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "An organizaiton that publishes and curates research datasets"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#prefLabel": [
      {
        "@language": "en",
        "@value": "dataset provider"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication338",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ResearchPublication"
    ],
    "http://prismstandard.org/namespaces/basic/2.0/publicationDate": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "2010-03-01"
      }
    ],
    "http://purl.org/dc/terms/creator": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#robert_lawrence"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#youfa_wang"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#may_a_beydoun"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#benjamin_caballero"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#tiffany_l_gary"
      }
    ],
    "http://purl.org/dc/terms/identifier": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://doi.org/10.1017/S1368980010000224"
      }
    ],
    "http://purl.org/dc/terms/language": [
      {
        "@value": "en"
      }
    ],
    "http://purl.org/dc/terms/publisher": [
      {
        "@language": "en",
        "@value": "Public Health Nutrition"
      }
    ],
    "http://purl.org/dc/terms/subject": [
      {
        "@language": "en",
        "@value": "Diet"
      },
      {
        "@language": "en",
        "@value": "Trend"
      },
      {
        "@language": "en",
        "@value": "Food intake"
      },
      {
        "@language": "en",
        "@value": "Meat consumption"
      },
      {
        "@language": "en",
        "@value": "United States"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@language": "en",
        "@value": "Trends and correlates in meat consumption patterns in the US adult population"
      }
    ],
    "http://purl.org/spar/cito/citesAsDataSource": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset481"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset_x001"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Topic",
    "@type": [
      "http://www.loc.gov/mads/rdf/v1#Topic",
      "http://www.loc.gov/mads/rdf/v1#Authority"
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "Concepts tied to LOC upper ontology http://id.loc.gov/authorities/subjects.html"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#robert_lawrence",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "Robert Lawrence"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.scopus.com/authid/detail.uri?authorId=7201490909"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#may_a_beydoun",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "May A Beydoun"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.researchgate.net/profile/May_Beydoun"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#benjamin_caballero",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "Benjamin Caballero"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://orcid.org/0000-0003-4311-6321"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ResearchPublication",
    "@type": [
      "http://purl.org/spar/fabio/ResearchPaper"
    ],
    "http://www.loc.gov/mads/rdf/v1#hasRelatedAuthority": [
      {
        "@id": "http://id.loc.gov/authorities/subjects/sh2004003366"
      }
    ],
    "http://www.w3.org/2002/07/owl#sameAs": [
      {
        "@id": "http://dbpedia.org/resource/Publication"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "A research publication that uses datasets for research"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#prefLabel": [
      {
        "@language": "en",
        "@value": "research publication"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#usda",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#DatasetProvider"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "USDA"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.usda.gov/"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Corpus",
    "@type": [
      "http://www.w3.org/2004/02/skos/core#Collection"
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@value": "ADRF Corpus of research publications"
      }
    ],
    "http://www.w3.org/2000/01/rdf-schema#label": [
      {
        "@value": "ADRF Corpus"
      }
    ],
    "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication340"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication338"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Dataset",
    "@type": [
      "http://www.w3.org/ns/dcat#Dataset"
    ],
    "http://www.loc.gov/mads/rdf/v1#hasRelatedAuthority": [
      {
        "@id": "http://id.loc.gov/authorities/subjects/sh2018002256"
      }
    ],
    "http://www.w3.org/2002/07/owl#sameAs": [
      {
        "@id": "http://dbpedia.org/resource/Data_set"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "A collection of tables and metadata used within the ADRF framework, managed by a responsible party"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#prefLabel": [
      {
        "@language": "en",
        "@value": "dataset"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology",
    "@type": [
      "http://www.w3.org/2004/02/skos/core#ConceptScheme"
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "A mid-level ontology for linked data within the ADRF framework use cases"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#prefLabel": [
      {
        "@language": "en",
        "@value": "ADRF Ontology"
      }
    ]
  }
]

And the above, but compacted with this @context:

{
  "@context": {
    "adrf": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#",
    "cito": "http://purl.org/spar/cito/",
    "dct": "http://purl.org/dc/terms/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "pav": "http://purl.org/pav/",
    "prism": "http://prismstandard.org/namespaces/basic/2.0/",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "dcat": "http://www.w3.org/ns/dcat#",
    "orcid": "https://orcid.org/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "dbpedia": "http://dbpedia.org/resource/",
    "mads": "http://www.loc.gov/mads/rdf/v1#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "owl": "http://www.w3.org/2002/07/owl#",
    "authority": "http://id.loc.gov/authorities/subjects/",
    "fabio": "http://purl.org/spar/fabio/",
    "doi": "https://doi.org/",
    "iso639": "http://id.loc.gov/vocabulary/iso639-1/"
  }

{
  "@graph": [
    {
      "@id": "adrf:dataset_x001",
      "@type": "adrf:Dataset",
      "dct:alternative": {
        "@language": "en",
        "@value": "CSFII"
      },
      "dct:description": {
        "@language": "en",
        "@value": "1-day dietary intakes of men 19 to 50 years of age living in the United States in 1985"
      },
      "dct:identifier": {
        "@type": "xsd:anyURI",
        "@value": "https://doi.org/10.3886/ICPSR21960.v1"
      },
      "dct:publisher": {
        "@id": "adrf:usda"
      },
      "dct:title": {
        "@language": "en",
        "@value": "Continuing Survey of Food Intakes by Individuals"
      },
      "pav:createdOn": {
        "@type": "xsd:date",
        "@value": "2009-01-27"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/21960/version/1"
      }
    },
    {
      "@id": "adrf:tiffany_l_gary",
      "@type": "adrf:Researcher",
      "foaf:name": {
        "@language": "en",
        "@value": "Tiffany L Gary"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://orcid.org/0000-0001-9843-1084"
      }
    },
    {
      "@id": "adrf:Catalog",
      "@type": "dcat:Catalog",
      "dct:language": {
        "@id": "iso639:en"
      },
      "dct:title": "ADRF Data Catalog",
      "rdfs:label": "ADRF Data Catalog",
      "skos:topConceptOf": {
        "@id": "adrf:ADRF_Ontology"
      },
      "foaf:homepage": {
        "@id": "https://coleridgeinitiative.org/"
      },
      "adrf:dataset": [
        {
          "@id": "adrf:dataset481"
        },
        {
          "@id": "adrf:dataset_x001"
        }
      ]
    },
    {
      "@id": "adrf:publication340",
      "@type": "adrf:ResearchPublication",
      "prism:publicationDate": {
        "@type": "xsd:date",
        "@value": "2009-11-11"
      },
      "dct:creator": [
        {
          "@id": "adrf:virginia_w_chang"
        },
        {
          "@id": "adrf:dawn_e_alley"
        }
      ],
      "dct:identifier": {
        "@type": "xsd:anyURI",
        "@value": "https://doi.org/10.1093/gerona/glp177"
      },
      "dct:language": "en",
      "dct:publisher": {
        "@language": "en",
        "@value": "The Journals of Gerontology: Series A"
      },
      "dct:subject": [
        {
          "@language": "en",
          "@value": "Lipids"
        },
        {
          "@language": "en",
          "@value": "Body mass index"
        },
        {
          "@language": "en",
          "@value": "Weight history"
        },
        {
          "@language": "en",
          "@value": "Metabolic syndrome"
        }
      ],
      "dct:title": {
        "@language": "en",
        "@value": "Metabolic syndrome and weight gain in adulthood"
      },
      "cito:citesAsDataSource": {
        "@id": "adrf:dataset481"
      }
    },
    {
      "@id": "adrf:dawn_e_alley",
      "@type": "adrf:Researcher",
      "foaf:name": {
        "@language": "en",
        "@value": "Dawn E Alley"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.scopus.com/authid/detail.uri?authorId=14041087200"
      }
    },
    {
      "@id": "adrf:dataset481",
      "@type": "adrf:Dataset",
      "dct:alternative": {
        "@language": "en",
        "@value": "NHANES"
      },
      "dct:description": {
        "@language": "en",
        "@value": "A program of studies designed to assess the health and nutritional status of adults and children in the United States"
      },
      "dct:identifier": {
        "@type": "xsd:anyURI",
        "@value": "https://doi.org/10.3886/ICPSR25501.v4"
      },
      "dct:publisher": {
        "@id": "adrf:national_center_for_health_statistics"
      },
      "dct:title": {
        "@language": "en",
        "@value": "National Health and Nutrition Examination Survey"
      },
      "pav:createdOn": {
        "@type": "xsd:date",
        "@value": "2012-02-22"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.icpsr.umich.edu/icpsrweb/NACDA/series/39"
      }
    },
    {
      "@id": "adrf:national_center_for_health_statistics",
      "@type": "adrf:DatasetProvider",
      "foaf:name": {
        "@language": "en",
        "@value": "National Center for Health Statistics"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.cdc.gov/nchs/index.htm"
      }
    },
    {
      "@id": "adrf:virginia_w_chang",
      "@type": "adrf:Researcher",
      "foaf:name": {
        "@language": "en",
        "@value": "Virginia W Chang"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.semanticscholar.org/author/Virginia-W-Chang/40382448"
      }
    },
    {
      "@id": "adrf:publication",
      "@type": [
        "owl:ObjectProperty",
        "rdf:Property"
      ],
      "rdfs:domain": {
        "@id": "adrf:Corpus"
      },
      "rdfs:label": {
        "@language": "en",
        "@value": "publication"
      },
      "rdfs:range": {
        "@id": "adrf:ResearchPublication"
      },
      "rdfs:subPropertyOf": {
        "@id": "dct:hasPart"
      }
    },
    {
      "@id": "adrf:Researcher",
      "@type": "foaf:Person",
      "mads:hasRelatedAuthority": {
        "@id": "authority:sh85089630"
      },
      "owl:sameAs": {
        "@id": "dbpedia:Author"
      },
      "skos:definition": {
        "@language": "en",
        "@value": "An author of a research publication that uses datasets for research"
      },
      "skos:prefLabel": {
        "@language": "en",
        "@value": "author"
      },
      "skos:topConceptOf": {
        "@id": "adrf:ADRF_Ontology"
      }
    },
    {
      "@id": "adrf:youfa_wang",
      "@type": "adrf:Researcher",
      "foaf:name": {
        "@language": "en",
        "@value": "Youfa Wang"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://scholar.google.com/citations?user=cHpphu0AAAAJ&hl=en&oi=ao"
      }
    },
    {
      "@id": "adrf:DatasetProvider",
      "@type": "foaf:Organization",
      "mads:hasRelatedAuthority": {
        "@id": "authority:sh85066157"
      },
      "owl:sameAs": {
        "@id": "dbpedia:Data_publishing"
      },
      "skos:definition": {
        "@language": "en",
        "@value": "An organizaiton that publishes and curates research datasets"
      },
      "skos:prefLabel": {
        "@language": "en",
        "@value": "dataset provider"
      },
      "skos:topConceptOf": {
        "@id": "adrf:ADRF_Ontology"
      }
    },
    {
      "@id": "adrf:publication338",
      "@type": "adrf:ResearchPublication",
      "prism:publicationDate": {
        "@type": "xsd:date",
        "@value": "2010-03-01"
      },
      "dct:creator": [
        {
          "@id": "adrf:robert_lawrence"
        },
        {
          "@id": "adrf:youfa_wang"
        },
        {
          "@id": "adrf:may_a_beydoun"
        },
        {
          "@id": "adrf:benjamin_caballero"
        },
        {
          "@id": "adrf:tiffany_l_gary"
        }
      ],
      "dct:identifier": {
        "@type": "xsd:anyURI",
        "@value": "https://doi.org/10.1017/S1368980010000224"
      },
      "dct:language": "en",
      "dct:publisher": {
        "@language": "en",
        "@value": "Public Health Nutrition"
      },
      "dct:subject": [
        {
          "@language": "en",
          "@value": "Diet"
        },
        {
          "@language": "en",
          "@value": "Trend"
        },
        {
          "@language": "en",
          "@value": "Food intake"
        },
        {
          "@language": "en",
          "@value": "Meat consumption"
        },
        {
          "@language": "en",
          "@value": "United States"
        }
      ],
      "dct:title": {
        "@language": "en",
        "@value": "Trends and correlates in meat consumption patterns in the US adult population"
      },
      "cito:citesAsDataSource": [
        {
          "@id": "adrf:dataset481"
        },
        {
          "@id": "adrf:dataset_x001"
        }
      ]
    },
    {
      "@id": "adrf:Topic",
      "@type": [
        "mads:Topic",
        "mads:Authority"
      ],
      "skos:definition": {
        "@language": "en",
        "@value": "Concepts tied to LOC upper ontology http://id.loc.gov/authorities/subjects.html"
      },
      "skos:topConceptOf": {
        "@id": "adrf:ADRF_Ontology"
      }
    },
    {
      "@id": "adrf:robert_lawrence",
      "@type": "adrf:Researcher",
      "foaf:name": {
        "@language": "en",
        "@value": "Robert Lawrence"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.scopus.com/authid/detail.uri?authorId=7201490909"
      }
    },
    {
      "@id": "adrf:may_a_beydoun",
      "@type": "adrf:Researcher",
      "foaf:name": {
        "@language": "en",
        "@value": "May A Beydoun"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.researchgate.net/profile/May_Beydoun"
      }
    },
    {
      "@id": "adrf:benjamin_caballero",
      "@type": "adrf:Researcher",
      "foaf:name": {
        "@language": "en",
        "@value": "Benjamin Caballero"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://orcid.org/0000-0003-4311-6321"
      }
    },
    {
      "@id": "adrf:ResearchPublication",
      "@type": "fabio:ResearchPaper",
      "mads:hasRelatedAuthority": {
        "@id": "authority:sh2004003366"
      },
      "owl:sameAs": {
        "@id": "dbpedia:Publication"
      },
      "skos:definition": {
        "@language": "en",
        "@value": "A research publication that uses datasets for research"
      },
      "skos:prefLabel": {
        "@language": "en",
        "@value": "research publication"
      },
      "skos:topConceptOf": {
        "@id": "adrf:ADRF_Ontology"
      }
    },
    {
      "@id": "adrf:usda",
      "@type": "adrf:DatasetProvider",
      "foaf:name": {
        "@language": "en",
        "@value": "USDA"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.usda.gov/"
      }
    },
    {
      "@id": "adrf:Corpus",
      "@type": "skos:Collection",
      "dct:title": "ADRF Corpus of research publications",
      "rdfs:label": "ADRF Corpus",
      "adrf:publication": [
        {
          "@id": "adrf:publication340"
        },
        {
          "@id": "adrf:publication338"
        }
      ]
    },
    {
      "@id": "adrf:Dataset",
      "@type": "dcat:Dataset",
      "mads:hasRelatedAuthority": {
        "@id": "authority:sh2018002256"
      },
      "owl:sameAs": {
        "@id": "dbpedia:Data_set"
      },
      "skos:definition": {
        "@language": "en",
        "@value": "A collection of tables and metadata used within the ADRF framework, managed by a responsible party"
      },
      "skos:prefLabel": {
        "@language": "en",
        "@value": "dataset"
      },
      "skos:topConceptOf": {
        "@id": "adrf:ADRF_Ontology"
      }
    },
    {
      "@id": "adrf:ADRF_Ontology",
      "@type": "skos:ConceptScheme",
      "skos:definition": {
        "@language": "en",
        "@value": "A mid-level ontology for linked data within the ADRF framework use cases"
      },
      "skos:prefLabel": {
        "@language": "en",
        "@value": "ADRF Ontology"
      }
    }
  ]
}

Great :) I'll see your compacted context and raise you a default vocabulary -- now in the code that generates the JSON-LD above ^^^ https://github.com/Coleridge-Initiative/adrf-onto/commit/04c4c235954cfb0cb8a39531bf545139a0baf181

I think having data in this form will really help these explorations. Thanks for putting this together Paco and Nick!

On Sat, Jul 20, 2019 at 12:44 PM Paco Nathan notifications@github.com wrote:

Great :) I'll see your compacted context and raise you a default vocabulary -- now in the code that generates the JSON-LD above ^^^ Coleridge-Initiative/adrf-onto@04c4c23

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

-- Brian E. Granger

Principal Technical Program Manager, AWS AI Platform (brgrange@amazon.com) On Leave - Professor of Physics and Data Science, Cal Poly @ellisonbg on GitHub

Thanks for this! I updated my prototype with this data and we can walk around it:

A further point I'd like to raise (was a bit rushed doing that initial extension of @ceteri's stuff, as i was trying to get something out the door before traveling): I think it's pretty unlikely that most front-end tools will be building their own ontologies on the fly, most of them will either be anchored in the Big N (haven't sussed out what those would be) like DC, OWL, PROV, etc. or would be either a) semi-officially adopted by a Jupyter metadata service contract (e.g. a conformance suite) or b) extended by an existing community of practice (e.g. NASA/ESA would probably have to collaborate on something) or c) on a per-application basis. At any rate, we'd probably end up with a well-known location, e.g. $prefix/share/jupyter/ontologies for a reference, file-based implementation, but a larger-scale organization might have a schema/ontology store.

Anyhow, keep in mind that ontologies frequently just describe what Can Be Known, not what Must Be Known. For that, we would likely have to consider a further constraint language, e.g. SPIN, SWRL or the present new hotness, SHACL.

What if we could know everything? If we start to know everything then a complex graph forms that may difficult to manage. Each schema describes facets of the problem. dcat is designed for interoperability between open datasets & void is a schema to describe how to reuse data. It will be necessary to apply MANY schema; dcat is a good choice because it references several Big N schema.

Users will traverse small cycles in a larger graph of knowledge. If we knew schema ahead of time could we assist authors in providing better markup.

Ultimately, any semantic information is opaque to the compute. How is an author incentivized to include more knowledge in their computational documents.

This notebook treats many schema as data in a dataframe. binder

The dataframe represents a graph with over 10000 edges including some information in multiple languages.

With this information, how could we help scientists?

I am going to close this for now, since we seem to have settled on displaying arbitrary JSON LD

jupyterlab / jupyterlab-metadata-service