jupyterlab / jupyterlab-metadata-service

Linked data exploration in JupyterLab.
BSD 3-Clause "New" or "Revised" License
29 stars 16 forks source link

Reflection on current project state, and a proposal for a metaschema #23

Closed acu192 closed 4 years ago

acu192 commented 5 years ago

(EDIT March 8, 2019) Turns out what I proposed below is more-or-less RDF (glad to find out that people have already solved these sorts of problems for us!). We're now working on a proposal to offer an RDF/JSON-LD extendable schema served through GraphQL for JupyterLab. I'll update this post once we finish thinking through our next proposal.

Reflection on current project state

A main goal of this project is to expose the relevant parts of schema.org's schema as a GraphQL service (see https://github.com/jupyterlab/jupyterlab-metadata-service/issues/4). This is to facilitate the storage and retrieval of various "rich context" data within JupyterLab.

As of last week, we have a minimal working prototype of this, which contains a GraphQL server and an interface for querying the metadata of a dataset (i.e. a concrete implementation of this small part of schema.org). We also use this same GraphQL server to store users' comments on files in JupyterLab; more on that later.

The goal of that minimal working prototype was twofold: (1) to have something working to demo to the stakeholders of this project, and (2) to explore all relevant technologies by building a "vertical slice" of the software stack.

Now that we've built this vertical slice, I have formed some opinions on how we should change our approach. I hope this post can start a discussion!

Proposal for a metaschema

So, we already use two schemas in our minimal prototype:

Two points to this:

  1. We already have two schemas! Since our goal is to give JupyterLab a "rich context", shared, extendable GraphQL service, it is reasonable to expect there will be more schemas needed in the future. Also, we have to consider that extensions may want to inject their own schema into this service. How might they do that?
  2. Schema.org is huge. We probably don't want to explicitly write out the entire schema (that's a lot of code--although it could be auto-generated surely), despite that's how our first approach began (see our Dataset definition here). Also, we are not following Schema.org exactly, e.g. every property's value should be an array according to schema.org, which we do not currently allow. Also, we would need to do some crazy unioning to precisely follow how schema.org allows properties to have "one or more types as its domain" -- it would be a mess. (For detail, see schema.org's data model.) One more note: It is expected that for any given object, most of the its properties will be unused -- thus again, it will be tedious to pass around concrete JS objects with all those fields defined yet mostly unused.

So, that was a description of the problems we've realized. To overcome those problems... I have some ideas below. Most of the ideas below were inspired by @saulshanabrook in one of our meetings. I've merely tried to articulate them further. (Saul, is below more-or-less what you were thinking?)

I propose we come up with a "metaschema". I.e. A schema to describes schemas. Another way to see this is that we would not implement schema.org as a "hard-coded schema", but instead it would be represented as data in the shape of a metaschema. Yet another way to say it: If you wanted to begin supporting a new part of schema.org (say, the FlightReservation type), you would do so be inserting data into the database to document the name of your new type ("FlightReseveration") and to list out the properties it may have. This idea of a metaschema also makes it simple to support multiple schemas together. In general I believe the metaschema solves all the issues mentioned above:

Well, at this point I hope I've given a lot to either agree or disagree with! Thoughts from the group? (Specifically @saulshanabrook @xmnlab @ellisonbg @bollwyvl)

ellisonbg commented 5 years ago

I do think it makes sense to explore this. A couple of points:

saulshanabrook commented 5 years ago

JSON-LD seems to be the recommended (Google, origins) standard. It can encompass Schema.org. The latest version, 1.1, is in draft and has some interactive examples on the spec.

We could store a denormalized version of JSON-LD in the backend as well as a denormalized version of context/schema so we know what fields are valid without having to send an HTTP request for the schema.

bollwyvl commented 5 years ago

Too much fun to fully write on my phone!

Yeah, by adopting these two contexts, we've already set up a lot of work if it can't be handled in a mostly automated fashion. Automation, akin to how schema-dts or pythreejs work, is the right path. Using the highest fidelity canonical description (they both publish jsonld of their meta-model) is probably the only way to get this done, and will implicitly handle the multiplicity and inheritance issues.

These would get you to dumb, but type-checkable, classes in many Jupyter languages. But they would be derived from the canonical definition. Combining the JSON-LD contexts from SDO/WADM one could derive a canonical serialization format, and only have to explicitly handle conflicts like Person, Organization and Dataset.

So that's Read and Print... what about Execute? Resolving these types is another task, and should be pretty decoupled from the contexts we implement as a concrete schema.

It would be folly to ignore actual graph implementation that expose a "real" graph query language.

For example, a schema provider backed by a full-on graph store could just build sparql/gremlin... rdfalchemy and sqlite would be enough for the single user experience. Bigger deployments might already be graphql-aware, like:

https://dgraph.io/ https://edgedb.com/

But would be harder to configure. At any rate, not only might you have multiple kinds of storage, you might use more than one storage/resolver on the same server for the same type. So this will take some serious thought, especially when it comes to things like pagination across multiple sources.

Both extensible schema and even extensible types/unions are important. I started on adding extensible schema (new types, which can reference/extend existing ones) on my prototype:

https://github.com/deathbeds/jupyter-graphql/pull/3/files#diff-8a9380c1249ac99297a763e1f9a4ee77

A pip-installable extension can add some things (query, mutation, subs) defined by graphene types.

Further (unpushed, for some reason) work adds an example of extending a type by adding fields, and while it's really ugly, using python type("",(),{}) magic, it does work.

The first example I tried extends notebook metadata, adding SlideShowMetaBase to the CellMetaData type. The contents plugin advertises this as another entry_point. I would rather use a union or something, but there's no multiple inheritance, so it would really one work in specific cases.

saulshanabrook commented 5 years ago

Here is a possible implementation story, inspired by a conversation with @dcharbon and looking again at the JSON LD spec.

User Stories

Users will want to click on a resource and see relevant metadata about it. They will want to be able to edit that metadata as well as click on resources in the metadata to see metadata about those.

Metadata providers want to be able to provide metadata for the user for certain resources. The metadata they provide will have different fields that likely come from some type specification for the type of object they are describing. As users edit the metadata, they need to be able to be notified of these changes so that they can update where they store the metadata.

Data model

A resource in this context is a Linked Data node, as laid out in the JSON LD spec. So it has some @id that is it's IRI/URL, as well as a number of @typess, and other attributes.

So as a Metadata Provider, you have to define a way to query yourself to see if you have metadata about a resource, and if you do, to return that in the expanded JSON LD syntax. You also have to define to update yourself with an updated version of the metadata.

The metadata explorer will see what the active resource is and query each of the providers to see if it has data about that resource. The first that does will be displayed to the user. Primitive types will be displayed without links, but types that link to other IDs will be displayed as links. All existing fields are editable, but the user cannot add new fields. The removes the need to process the type at all to understand what all valid field for it could be. The ability to add new fields from the UI could be added at a later date. If a user edits a field, the provider that had that metadata gets notified with the updated object.


This implementation allows us to integrate our existing graphql backend, but the core APIs would not depend on it and allow users to define other backends however they want to provide and persist metadata.

The major technical hurdles I see here are creating proper editable UIs given arbitrary JSON LD nodes and communicating the proper structure of the nodes that the data provider should return.

ellisonbg commented 5 years ago

I like this idea - this is really the type of problem that JSON LD was invented to solve. Questions and thoughts:

saulshanabrook commented 5 years ago

Working with JSON LD can be a bit painful. I am a bit hesitant to force this on all JLab extensions wanting to work with metadata. Any libraries to make this less painful?

I have seen the schema-dts library which lets you generate TypeScript types for different Schema.org types. We are pushing the boundaries here, so we would probably end up having to create any tools we need. It doesn't seem that hard to create JSON LD, like this example:

{
  "@context": "http://schema.org/",
  "name": "Jane Doe",
  "jobTitle": "Professor",
  "telephone": "(425) 123-4567",
  "url": "http://www.janedoe.com"
}

There is also the standard jsonld.js library which lets you translate between different forms of JSON LD.

I am not quite clear on how you are envisioning the backend(s) for this working. Can you talk more about that. I would be hesitant to have multiple different metadata backends

Sure. This is the interface I am thinking about:

type LinkedData = {
    '@id': string,
    [prop: string] : any
}

interface IMetadataProvider {
    // Maybe this method is not required
    listResources(): Promise<Array<URL>>;

    getResource(resource: URL): Promise<LinkedData>;
    updateResource(data: LinkedData): Promise<void>;
}

Multiple providers would be useful if you already have your metadata stored somewhere, and don't wanna replicate it into a local GraphQL database. Instead, you can access your existing store however you like client side and as long as you can query it about resources. It also provides an abstraction layer over graphql, so if we wanna move metadata storage into the real time data store, we can do this by implementing a new provider, without changing the metadata extension.

xmnlab commented 5 years ago

there is an implementation of graphql server using json-ld concept: https://www.hypergraphql.org/ ... not sure yet if it could be helpful.

xmnlab commented 5 years ago

about graphql layer, one thing that we need to keep in mind is it is strong typed ... so work with generic structure doesn't work very well ... so that is why I am investigating graphql-schema-org.

acu192 commented 5 years ago

As @xmnlab mentions, with GraphQL's typed schema, you can't extend the schema at run-time. I.e. If we wanted RDF's notion of "say anything about anything", then we'd need to look elsewhere for a solution. (Right?)

So, maybe we should step back and say: "How extendable do we actually want this metadata service to be?"

The way I see it we have a few options:

  1. Extendable only by modifying to code. I.e. "You want to extend it? Send in a PR to this repo."
  2. Extendable by pip install my_super_jupyter_metadata_schema and restart JupyterLab. This is analogous to @bollwyvl's PR he posted above (albeit he is using Graphine instead of Apollo). Within this option are two sub-options: (1) the ability to extend only by adding a new top-level schema, and/or (2) the ability to extend any existing type or add types to an existing schema.
  3. Extendable at run-time. I.e. Pure RDF (as I understand RDF...). The idea here is at runtime, as a user, you could say "You know, my files really need a favorite color." So you go edit each file's metadata to add a property named favorite_color, and that persists and is visible to everyone else as well.

Option 1 is obviously not what we want.

Option 2 is interesting... if we choose this option, there are of course many more questions to answer, but GraphQL could do this as @bollwyvl has already shown (via python-graphine).

Option 3 is also interesting. It takes more of the semantic web mindset. @dcharbon has thoughts on how to go about this, which he has partially shared with us.

@ellisonbg Which option above is most inline with your thoughts?

saulshanabrook commented 5 years ago

Extendable at run-time. I.e. Pure RDF (as I understand RDF...). The idea here is at runtime, as a user, you could say "You know, my files really need a favorite color." So you go edit each file's metadata to add a property named favorite_color, and that persists and is visible to everyone else as well.

I wrote up some notes explaining this idea more, to articulate how we might support arbitrary types of schemas. I think for now we are going to work on getting the current approach working with editing, and then eventually try to create a JupyterLab extension API for this kind of system:

Goals:

JupyterLab Metadata Extension:

JupyterLab Metadata API:

JupyterLab Metadata Server:

ceteri commented 5 years ago

Here's an example (trivial example, but it shows the point) of how controlled vocabularies are referenced in use cases, i.e., in an example integration of JupyterLab Metadata Service: https://github.com/Coleridge-Initiative/adrf-onto/blob/master/adrf.ttl Note that almost always there are multiple vocabularies being both blended and extended.

saulshanabrook commented 5 years ago

Thank you, that's very helpful to see. I haven't used Turtle at all before now. It would also be helpful to see how that matches up with a particular instance of some data at some point.

ceteri commented 5 years ago

We're building out examples from the ADRF framework -- the NYU project which will use these data registry and metadata service features in Jupyter -- that use Turtle and JSON-LD interchangeably, depending on "what" is reading the file. Will share those with the project here.

From an AI practitioner standpoint, I would expect my peers to use Turtle in human-curated definitions.

Also, the wiki in that ADRF repo above links to more details and resources about Turtle, JSON-LD, other vocabularies, etc.

ceteri commented 5 years ago

Here's an example of a formal metadata description for a dataset, based on training data used in the Rich Context Competition:

:dataset381
  rdf:type dctypes:Dataset;
  dct:title "Established Populations for Epidemiologic Studies of the Elderly Project"@en;
  dct:alternative "EPESE"@en;
  dct:description "A project initiated by the intramural Epidemiology, Demography and Biometry Program of the National Institute on Aging"@en;
  pav:createdOn "1993-02-01"^^xsd:date;
  dct:identifier "8481423"@en;
  foaf:page <https://www.ncbi.nlm.nih.gov/pubmed/8481423>;
  dct:publisher :duke_univ;
  pav:curatedBy :cornoni-huntley_j;
  .

That resolves (again, with ~7 lines of Py) into a graph:

@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dcmitype: <http://purl.org/dc/dcmitype/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix ns1: <http://purl.org/pav/> .
@prefix ns2: <http://xmlns.com/foaf/0.1/> .
@prefix ns3: <http://www.loc.gov/mads/rdf/v1#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset381> a dcmitype:Dataset ;
    dct:alternative "EPESE"@en ;
    dct:description "A project initiated by the intramural Epidemiology, Demography and Biometry Program of the National Institute on Aging"@en ;
    dct:identifier "8481423"@en ;
    dct:publisher <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#duke_univ> ;
    dct:title "Established Populations for Epidemiologic Studies of the Elderly Project"@en ;
    ns1:createdOn "1993-02-01"^^xsd:date ;
    ns1:curatedBy <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#cornoni-huntley_j> ;
    ns2:page <https://www.ncbi.nlm.nih.gov/pubmed/8481423> .

<https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Dataset> a dcmitype:Dataset,
        skos:Concept ;
    ns3:hasRelatedAuthority <http://id.loc.gov/authorities/subjects/sh2018002256> ;
    = <http://dbpedia.org/resource/Data_set> ;
    skos:definition "A collection of tables and metadata, managed by a responsible party"@en ;
    skos:inScheme <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology> ;
    skos:prefLabel "dataset"@en ;
    skos:topConceptOf <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology> .

<https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Topic> a ns3:Authority,
        ns3:Topic ;
    skos:definition "Concepts tied to LOC upper ontology http://id.loc.gov/authorities/subjects.html"@en ;
    skos:inScheme <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology> ;
    skos:topConceptOf <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology> .

<https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology> a skos:ConceptScheme ;
    skos:definition "A mid-level ontology for linked data within the ADRF framework use cases"@en ;
    skos:hasTopConcept <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Dataset>,
        <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Topic> ;
    skos:prefLabel "ADRF Ontology"@en .

This is a small example -- only a single entity represented -- although it shows how a knowledge graph looks.

The following presentation describes how to handle metadata for linked data, data catalogs in practice, leading into knowledge graph work for reproducible science: https://www.slideshare.net/tplasterer/dataset-catalogs-as-a-foundation-for-fair-data

One of the better online specs/tutorials for how to handle this kind of metadata markup is at: https://www.w3.org/TR/hcls-dataset/#appendix_1

Our WIP code example for NYU and knowledge graph in social science research across US fed/state/local agencies is at https://github.com/Coleridge-Initiative/adrf-onto/

What's shown above is an example of to how metadata about linked data for social science, also applies in life sciences, etc. It doesn't illustrate the curation/data stewardship links (next on my TODO list). Even so, note that this level of governance will be applied to datasets and publications throughout the sciences, and given the push for compliance, data privacy, provenance, audits, etc., similar kinds of graph-based data governance are showing up for finance, healthcare, manufacturing, etc. I have a hunch that in talking with Capital One, Two Sigma, Bloomberg about their use of metadata about datasets would look similar to this. Ongoing regulatory compliance efforts will push that point even further. That's why I'm urging Jupyter to consider more of this approach for the Metadata Service.

ceteri commented 5 years ago

Then, 2 more lines of Py:

from rdflib.serializer import Serializer
print(graph.serialize(format="json-ld", indent=2))

Can transform that graph into JSON-LD, so that it's easily read by machines (without any special parsing beyond JSON) and also readily exchangeable via APIs:

[
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Dataset",
    "@type": [
      "http://www.w3.org/2004/02/skos/core#Concept",
      "http://purl.org/dc/dcmitype/Dataset"
    ],
    "http://www.loc.gov/mads/rdf/v1#hasRelatedAuthority": [
      {
        "@id": "http://id.loc.gov/authorities/subjects/sh2018002256"
      }
    ],
    "http://www.w3.org/2002/07/owl#sameAs": [
      {
        "@id": "http://dbpedia.org/resource/Data_set"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "A collection of tables and metadata, managed by a responsible party"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#prefLabel": [
      {
        "@language": "en",
        "@value": "dataset"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Topic",
    "@type": [
      "http://www.loc.gov/mads/rdf/v1#Authority",
      "http://www.loc.gov/mads/rdf/v1#Topic"
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "Concepts tied to LOC upper ontology http://id.loc.gov/authorities/subjects.html"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset381",
    "@type": [
      "http://purl.org/dc/dcmitype/Dataset"
    ],
    "http://purl.org/dc/terms/alternative": [
      {
        "@language": "en",
        "@value": "EPESE"
      }
    ],
    "http://purl.org/dc/terms/description": [
      {
        "@language": "en",
        "@value": "A project initiated by the intramural Epidemiology, Demography and Biometry Program of the National Institute on Aging"
      }
    ],
    "http://purl.org/dc/terms/identifier": [
      {
        "@language": "en",
        "@value": "8481423"
      }
    ],
    "http://purl.org/dc/terms/publisher": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#duke_univ"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@language": "en",
        "@value": "Established Populations for Epidemiologic Studies of the Elderly Project"
      }
    ],
    "http://purl.org/pav/createdOn": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "1993-02-01"
      }
    ],
    "http://purl.org/pav/curatedBy": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#cornoni-huntley_j"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@id": "https://www.ncbi.nlm.nih.gov/pubmed/8481423"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology",
    "@type": [
      "http://www.w3.org/2004/02/skos/core#ConceptScheme"
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "A mid-level ontology for linked data within the ADRF framework use cases"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#prefLabel": [
      {
        "@language": "en",
        "@value": "ADRF Ontology"
      }
    ]
  }
]

Note that I've pretty-printed here to help visualize it, though this JSON-LD would compress during API calls.

saulshanabrook commented 5 years ago

@ceteri Thanks for these examples, this is really helpful.

So I could see the API as taking the ID of a entity and returning the flattened JSON-LD for that entity. Then we could generate a UI around these these field mappings.

console.log(myDataProvider.get('https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset381'))
{
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset381",
    "@type": [
      "http://purl.org/dc/dcmitype/Dataset"
    ],
    "http://purl.org/dc/terms/alternative": [
      {
        "@language": "en",
        "@value": "EPESE"
      }
    ],
    "http://purl.org/dc/terms/description": [
      {
        "@language": "en",
        "@value": "A project initiated by the intramural Epidemiology, Demography and Biometry Program of the National Institute on Aging"
      }
    ],
    "http://purl.org/dc/terms/identifier": [
      {
        "@language": "en",
        "@value": "8481423"
      }
    ],
    "http://purl.org/dc/terms/publisher": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#duke_univ"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@language": "en",
        "@value": "Established Populations for Epidemiologic Studies of the Elderly Project"
      }
    ],
    "http://purl.org/pav/createdOn": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "1993-02-01"
      }
    ],
    "http://purl.org/pav/curatedBy": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#cornoni-huntley_j"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@id": "https://www.ncbi.nlm.nih.gov/pubmed/8481423"
      }
    ]
  }
ceteri commented 5 years ago

Excellent, for example that could be rendered as nested tables in a reasonable compact way. A user could click to follow the links of IRI dereferencing to understand what the type definitions mean. In other words, click through to get "http://purl.org/dc/terms/title" which any browser can do simply.

ceteri commented 5 years ago

This would be one way to view a graph of metadata used to answer the questions that @ellisonbg enumerated: image Note that each of those entities has other fields. This is a start for how I could see the metadata used in practice.

@tonyfast does that fit with what you're thinking too?

ceteri commented 5 years ago

Another general note about open standards for metadata about datasets used in science:

I'd recommend reading the FAIR Data Principles, originally described in

The FAIR acronym stands for Findable, Accessible, Interoperable, Reusable -- as guidelines for data infrastructure to support reproducible research.

While that article does not specifically mention Jupyter, reading between the lines there's so much overlap between that widely accepted set of FAIR practices and intentions for the JupyterLab metadata service.

tonyfast commented 5 years ago

We're probably in agreement, but there is a lot of different language be used to talk about this problem so I cannot say exactly.


I see the analysis project as graph that can be searched. It's contents - virtual file systems - is enriched by type information & typology. Graph and structured databases will execute queries; and the query language bounds the questions we can ask. From this perspective, we can ask questions composed in the specified query language.

For example, this is a sqlite database of a couple hundred notebooks we could ask questions like:

A graph database could answer different questions, and a more general tool would query all types of files as data. There could be so many way to query, "blood sample please?".

At this point though, I think I'm stuck on how did the types get there? Who is annotating information with metadata? Where is the knowledge coming from? How do we know what types are salient?


From the json-ld perspective, I think that directories should permit multiples contexts with "*.jsonld" files. These contexts establish a shared language for resources across the project. These files would include documentation IRIs, demo IRIs, web types, and python type IRIs. A simple example would have production and research contexts. Context is a feature toggle, kinda sorta.


A lot of the metadata desired in the diagram could be recorded as a stream of context information. Which I think raises the important question again, "How are the types being annotated?"

tonyfast commented 5 years ago

Based on some experiments with pyld it appears that both web annotations and metadata can be represented by a @graph in a json-ld expansion. The notebook below experiments with composing types using python annotations and rdflib objects.

https://gist.github.com/tonyfast/443c7b5b23449ef9fe7b024538ff2261

saulshanabrook commented 5 years ago

@ceteri I see you have one piece of linked data here as an example. Do you have a larger set readily available? I am putting together a little prototype for the metada explorer and would like to use some of your real-ish data if possible.

EDIT: Ideally it would be good to have an example with multiple entities that link to each other.

ceteri commented 5 years ago

@saulshanabrook @tonyfast: last weekend at Sci Foo, @ellisonbg and I sketched out functionality for a minimum viable UI to demonstrate the Metadata Explorer.

1/ Let's assume we're starting with some file type that represents the metadata for a specific dataset. JupyterLab knows to launch this UI based on the file type -- or mime type for metadata from a URI.

For example, let's say a dataset has been registered through the Data Explorer, e.g., a Jupyter notebook knows where to lookup metadata for it.

2/ pull down the metadata, render it with hyperlinks on any URI among the properties in the metadata

For example, the EPESE dataset may have a DOI identifier that links out to:

We also must follow the metadata references to papers that cite usage of the dataset:

For each author, there will be pages on ORCID, Google Scholar, ResearchGate, etc.:

A dataset will also have a Data Provider, such as:

3/ So the UI should provide means for a user to follow those links. Let people click through to the next page's metadata.

4/ Generally when we're following links from a web page, those are URI that resolve out to HTML, which the browsers render. Here we are following a secondary layer of metadata that's often embedded in the web pages -- sometimes called micro data.

Not all providers will have endpoint URIs to obtain just the metadata.

We can build scrapers/gateways for enough of them to demo the UI. Also, there aren't a large number of these kinds of sites. We can also work with the providers to get additional endpoints available -- that's current dialog among scientific publishers. I've already talked with a Google eng manager about this, and they'd be interested plus they have the eng resources to support. It's not a lot of work either.

5/ the strategy is to follow these links as much as possible. that would impl a traversal of the graph shown above in https://github.com/jupyterlab/jupyterlab-metadata-service/issues/23#issuecomment-506840526

Next up, I'll develop a better sample file to use, in JSON-LD

ceteri commented 5 years ago

Footnote: one way to integrate with the scrapers/gateways would be to register a URI pattern, then when a user clicks a link with that URI pattern in the metadata explorer UI, we use the scraper/gateway instead of simply doing the HTTP GET on the URI.

For example, the HTML results on Google Dataset Search are basically a list of JSON, plus some JavaScript to render it. It's not hard to scrape that kind response and build a metadata gateway for it. The other popular providers, such as ORCID and ResearchGate have embedded metadata (aka "micro data") that we can scrape similarly.

saulshanabrook commented 5 years ago

This is a little demo I put together to show a simple linked data explorer https://github.com/jupyterlab/jupyterlab-metadata-service/pull/27

The core of it is a linked data provider that has a function that takes a URL and returns some linked data. We can hook this up to a server extension that serves up scraped data about certain URLs.

ceteri commented 5 years ago

@saulshanabrook @tonyfast @ellisonbg here's TTL for an example from the Rich Context Competition which includes 2 research papers, 2 datasets used by them, 7 authors, and the 2 data providers: https://github.com/Coleridge-Initiative/adrf-onto/blob/master/rcc.ttl

:Catalog and :Corpus are collections within the graph for datasets and research publications respectively

Here's that same metadata graph converted to JSON-LD (since GH doesn't yet support attaching JSON files?)

[
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset_x001",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Dataset"
    ],
    "http://purl.org/dc/terms/alternative": [
      {
        "@language": "en",
        "@value": "CSFII"
      }
    ],
    "http://purl.org/dc/terms/description": [
      {
        "@language": "en",
        "@value": "1-day dietary intakes of men 19 to 50 years of age living in the United States in 1985"
      }
    ],
    "http://purl.org/dc/terms/identifier": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://doi.org/10.3886/ICPSR21960.v1"
      }
    ],
    "http://purl.org/dc/terms/publisher": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#usda"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@language": "en",
        "@value": "Continuing Survey of Food Intakes by Individuals"
      }
    ],
    "http://purl.org/pav/createdOn": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "2009-01-27"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/21960/version/1"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#tiffany_l_gary",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "Tiffany L Gary"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://orcid.org/0000-0001-9843-1084"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Catalog",
    "@type": [
      "http://www.w3.org/ns/dcat#Catalog"
    ],
    "http://purl.org/dc/terms/language": [
      {
        "@id": "http://id.loc.gov/vocabulary/iso639-1/en"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@value": "ADRF Data Catalog"
      }
    ],
    "http://www.w3.org/2000/01/rdf-schema#label": [
      {
        "@value": "ADRF Data Catalog"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ],
    "http://xmlns.com/foaf/0.1/homepage": [
      {
        "@id": "https://coleridgeinitiative.org/"
      }
    ],
    "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset481"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset_x001"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication340",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ResearchPublication"
    ],
    "http://prismstandard.org/namespaces/basic/2.0/publicationDate": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "2009-11-11"
      }
    ],
    "http://purl.org/dc/terms/creator": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#virginia_w_chang"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dawn_e_alley"
      }
    ],
    "http://purl.org/dc/terms/identifier": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://doi.org/10.1093/gerona/glp177"
      }
    ],
    "http://purl.org/dc/terms/language": [
      {
        "@value": "en"
      }
    ],
    "http://purl.org/dc/terms/publisher": [
      {
        "@language": "en",
        "@value": "The Journals of Gerontology: Series A"
      }
    ],
    "http://purl.org/dc/terms/subject": [
      {
        "@language": "en",
        "@value": "Lipids"
      },
      {
        "@language": "en",
        "@value": "Body mass index"
      },
      {
        "@language": "en",
        "@value": "Weight history"
      },
      {
        "@language": "en",
        "@value": "Metabolic syndrome"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@language": "en",
        "@value": "Metabolic syndrome and weight gain in adulthood"
      }
    ],
    "http://purl.org/spar/cito/citesAsDataSource": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset481"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dawn_e_alley",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "Dawn E Alley"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.scopus.com/authid/detail.uri?authorId=14041087200"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset481",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Dataset"
    ],
    "http://purl.org/dc/terms/alternative": [
      {
        "@language": "en",
        "@value": "NHANES"
      }
    ],
    "http://purl.org/dc/terms/description": [
      {
        "@language": "en",
        "@value": "A program of studies designed to assess the health and nutritional status of adults and children in the United States"
      }
    ],
    "http://purl.org/dc/terms/identifier": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://doi.org/10.3886/ICPSR25501.v4"
      }
    ],
    "http://purl.org/dc/terms/publisher": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#national_center_for_health_statistics"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@language": "en",
        "@value": "National Health and Nutrition Examination Survey"
      }
    ],
    "http://purl.org/pav/createdOn": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "2012-02-22"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.icpsr.umich.edu/icpsrweb/NACDA/series/39"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#national_center_for_health_statistics",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#DatasetProvider"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "National Center for Health Statistics"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.cdc.gov/nchs/index.htm"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#virginia_w_chang",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "Virginia W Chang"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.semanticscholar.org/author/Virginia-W-Chang/40382448"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication",
    "@type": [
      "http://www.w3.org/2002/07/owl#ObjectProperty",
      "http://www.w3.org/1999/02/22-rdf-syntax-ns#Property"
    ],
    "http://www.w3.org/2000/01/rdf-schema#domain": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Corpus"
      }
    ],
    "http://www.w3.org/2000/01/rdf-schema#label": [
      {
        "@language": "en",
        "@value": "publication"
      }
    ],
    "http://www.w3.org/2000/01/rdf-schema#range": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ResearchPublication"
      }
    ],
    "http://www.w3.org/2000/01/rdf-schema#subPropertyOf": [
      {
        "@id": "http://purl.org/dc/terms/hasPart"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher",
    "@type": [
      "http://xmlns.com/foaf/0.1/Person"
    ],
    "http://www.loc.gov/mads/rdf/v1#hasRelatedAuthority": [
      {
        "@id": "http://id.loc.gov/authorities/subjects/sh85089630"
      }
    ],
    "http://www.w3.org/2002/07/owl#sameAs": [
      {
        "@id": "http://dbpedia.org/resource/Author"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "An author of a research publication that uses datasets for research"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#prefLabel": [
      {
        "@language": "en",
        "@value": "author"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#youfa_wang",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "Youfa Wang"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://scholar.google.com/citations?user=cHpphu0AAAAJ&hl=en&oi=ao"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#DatasetProvider",
    "@type": [
      "http://xmlns.com/foaf/0.1/Organization"
    ],
    "http://www.loc.gov/mads/rdf/v1#hasRelatedAuthority": [
      {
        "@id": "http://id.loc.gov/authorities/subjects/sh85066157"
      }
    ],
    "http://www.w3.org/2002/07/owl#sameAs": [
      {
        "@id": "http://dbpedia.org/resource/Data_publishing"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "An organizaiton that publishes and curates research datasets"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#prefLabel": [
      {
        "@language": "en",
        "@value": "dataset provider"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication338",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ResearchPublication"
    ],
    "http://prismstandard.org/namespaces/basic/2.0/publicationDate": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "2010-03-01"
      }
    ],
    "http://purl.org/dc/terms/creator": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#robert_lawrence"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#youfa_wang"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#may_a_beydoun"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#benjamin_caballero"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#tiffany_l_gary"
      }
    ],
    "http://purl.org/dc/terms/identifier": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://doi.org/10.1017/S1368980010000224"
      }
    ],
    "http://purl.org/dc/terms/language": [
      {
        "@value": "en"
      }
    ],
    "http://purl.org/dc/terms/publisher": [
      {
        "@language": "en",
        "@value": "Public Health Nutrition"
      }
    ],
    "http://purl.org/dc/terms/subject": [
      {
        "@language": "en",
        "@value": "Diet"
      },
      {
        "@language": "en",
        "@value": "Trend"
      },
      {
        "@language": "en",
        "@value": "Food intake"
      },
      {
        "@language": "en",
        "@value": "Meat consumption"
      },
      {
        "@language": "en",
        "@value": "United States"
      }
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@language": "en",
        "@value": "Trends and correlates in meat consumption patterns in the US adult population"
      }
    ],
    "http://purl.org/spar/cito/citesAsDataSource": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset481"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#dataset_x001"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Topic",
    "@type": [
      "http://www.loc.gov/mads/rdf/v1#Topic",
      "http://www.loc.gov/mads/rdf/v1#Authority"
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "Concepts tied to LOC upper ontology http://id.loc.gov/authorities/subjects.html"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#robert_lawrence",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "Robert Lawrence"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.scopus.com/authid/detail.uri?authorId=7201490909"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#may_a_beydoun",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "May A Beydoun"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.researchgate.net/profile/May_Beydoun"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#benjamin_caballero",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Researcher"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "Benjamin Caballero"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://orcid.org/0000-0003-4311-6321"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ResearchPublication",
    "@type": [
      "http://purl.org/spar/fabio/ResearchPaper"
    ],
    "http://www.loc.gov/mads/rdf/v1#hasRelatedAuthority": [
      {
        "@id": "http://id.loc.gov/authorities/subjects/sh2004003366"
      }
    ],
    "http://www.w3.org/2002/07/owl#sameAs": [
      {
        "@id": "http://dbpedia.org/resource/Publication"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "A research publication that uses datasets for research"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#prefLabel": [
      {
        "@language": "en",
        "@value": "research publication"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#usda",
    "@type": [
      "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#DatasetProvider"
    ],
    "http://xmlns.com/foaf/0.1/name": [
      {
        "@language": "en",
        "@value": "USDA"
      }
    ],
    "http://xmlns.com/foaf/0.1/page": [
      {
        "@type": "http://www.w3.org/2001/XMLSchema#anyURI",
        "@value": "https://www.usda.gov/"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Corpus",
    "@type": [
      "http://www.w3.org/2004/02/skos/core#Collection"
    ],
    "http://purl.org/dc/terms/title": [
      {
        "@value": "ADRF Corpus of research publications"
      }
    ],
    "http://www.w3.org/2000/01/rdf-schema#label": [
      {
        "@value": "ADRF Corpus"
      }
    ],
    "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication340"
      },
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#publication338"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#Dataset",
    "@type": [
      "http://www.w3.org/ns/dcat#Dataset"
    ],
    "http://www.loc.gov/mads/rdf/v1#hasRelatedAuthority": [
      {
        "@id": "http://id.loc.gov/authorities/subjects/sh2018002256"
      }
    ],
    "http://www.w3.org/2002/07/owl#sameAs": [
      {
        "@id": "http://dbpedia.org/resource/Data_set"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "A collection of tables and metadata used within the ADRF framework, managed by a responsible party"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#prefLabel": [
      {
        "@language": "en",
        "@value": "dataset"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#topConceptOf": [
      {
        "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology"
      }
    ]
  },
  {
    "@id": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#ADRF_Ontology",
    "@type": [
      "http://www.w3.org/2004/02/skos/core#ConceptScheme"
    ],
    "http://www.w3.org/2004/02/skos/core#definition": [
      {
        "@language": "en",
        "@value": "A mid-level ontology for linked data within the ADRF framework use cases"
      }
    ],
    "http://www.w3.org/2004/02/skos/core#prefLabel": [
      {
        "@language": "en",
        "@value": "ADRF Ontology"
      }
    ]
  }
]
bollwyvl commented 5 years ago

And the above, but compacted with this @context:

{
  "@context": {
    "adrf": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#",
    "cito": "http://purl.org/spar/cito/",
    "dct": "http://purl.org/dc/terms/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "pav": "http://purl.org/pav/",
    "prism": "http://prismstandard.org/namespaces/basic/2.0/",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "dcat": "http://www.w3.org/ns/dcat#",
    "orcid": "https://orcid.org/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "dbpedia": "http://dbpedia.org/resource/",
    "mads": "http://www.loc.gov/mads/rdf/v1#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "owl": "http://www.w3.org/2002/07/owl#",
    "authority": "http://id.loc.gov/authorities/subjects/",
    "fabio": "http://purl.org/spar/fabio/",
    "doi": "https://doi.org/",
    "iso639": "http://id.loc.gov/vocabulary/iso639-1/"
  }
{
  "@graph": [
    {
      "@id": "adrf:dataset_x001",
      "@type": "adrf:Dataset",
      "dct:alternative": {
        "@language": "en",
        "@value": "CSFII"
      },
      "dct:description": {
        "@language": "en",
        "@value": "1-day dietary intakes of men 19 to 50 years of age living in the United States in 1985"
      },
      "dct:identifier": {
        "@type": "xsd:anyURI",
        "@value": "https://doi.org/10.3886/ICPSR21960.v1"
      },
      "dct:publisher": {
        "@id": "adrf:usda"
      },
      "dct:title": {
        "@language": "en",
        "@value": "Continuing Survey of Food Intakes by Individuals"
      },
      "pav:createdOn": {
        "@type": "xsd:date",
        "@value": "2009-01-27"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/21960/version/1"
      }
    },
    {
      "@id": "adrf:tiffany_l_gary",
      "@type": "adrf:Researcher",
      "foaf:name": {
        "@language": "en",
        "@value": "Tiffany L Gary"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://orcid.org/0000-0001-9843-1084"
      }
    },
    {
      "@id": "adrf:Catalog",
      "@type": "dcat:Catalog",
      "dct:language": {
        "@id": "iso639:en"
      },
      "dct:title": "ADRF Data Catalog",
      "rdfs:label": "ADRF Data Catalog",
      "skos:topConceptOf": {
        "@id": "adrf:ADRF_Ontology"
      },
      "foaf:homepage": {
        "@id": "https://coleridgeinitiative.org/"
      },
      "adrf:dataset": [
        {
          "@id": "adrf:dataset481"
        },
        {
          "@id": "adrf:dataset_x001"
        }
      ]
    },
    {
      "@id": "adrf:publication340",
      "@type": "adrf:ResearchPublication",
      "prism:publicationDate": {
        "@type": "xsd:date",
        "@value": "2009-11-11"
      },
      "dct:creator": [
        {
          "@id": "adrf:virginia_w_chang"
        },
        {
          "@id": "adrf:dawn_e_alley"
        }
      ],
      "dct:identifier": {
        "@type": "xsd:anyURI",
        "@value": "https://doi.org/10.1093/gerona/glp177"
      },
      "dct:language": "en",
      "dct:publisher": {
        "@language": "en",
        "@value": "The Journals of Gerontology: Series A"
      },
      "dct:subject": [
        {
          "@language": "en",
          "@value": "Lipids"
        },
        {
          "@language": "en",
          "@value": "Body mass index"
        },
        {
          "@language": "en",
          "@value": "Weight history"
        },
        {
          "@language": "en",
          "@value": "Metabolic syndrome"
        }
      ],
      "dct:title": {
        "@language": "en",
        "@value": "Metabolic syndrome and weight gain in adulthood"
      },
      "cito:citesAsDataSource": {
        "@id": "adrf:dataset481"
      }
    },
    {
      "@id": "adrf:dawn_e_alley",
      "@type": "adrf:Researcher",
      "foaf:name": {
        "@language": "en",
        "@value": "Dawn E Alley"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.scopus.com/authid/detail.uri?authorId=14041087200"
      }
    },
    {
      "@id": "adrf:dataset481",
      "@type": "adrf:Dataset",
      "dct:alternative": {
        "@language": "en",
        "@value": "NHANES"
      },
      "dct:description": {
        "@language": "en",
        "@value": "A program of studies designed to assess the health and nutritional status of adults and children in the United States"
      },
      "dct:identifier": {
        "@type": "xsd:anyURI",
        "@value": "https://doi.org/10.3886/ICPSR25501.v4"
      },
      "dct:publisher": {
        "@id": "adrf:national_center_for_health_statistics"
      },
      "dct:title": {
        "@language": "en",
        "@value": "National Health and Nutrition Examination Survey"
      },
      "pav:createdOn": {
        "@type": "xsd:date",
        "@value": "2012-02-22"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.icpsr.umich.edu/icpsrweb/NACDA/series/39"
      }
    },
    {
      "@id": "adrf:national_center_for_health_statistics",
      "@type": "adrf:DatasetProvider",
      "foaf:name": {
        "@language": "en",
        "@value": "National Center for Health Statistics"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.cdc.gov/nchs/index.htm"
      }
    },
    {
      "@id": "adrf:virginia_w_chang",
      "@type": "adrf:Researcher",
      "foaf:name": {
        "@language": "en",
        "@value": "Virginia W Chang"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.semanticscholar.org/author/Virginia-W-Chang/40382448"
      }
    },
    {
      "@id": "adrf:publication",
      "@type": [
        "owl:ObjectProperty",
        "rdf:Property"
      ],
      "rdfs:domain": {
        "@id": "adrf:Corpus"
      },
      "rdfs:label": {
        "@language": "en",
        "@value": "publication"
      },
      "rdfs:range": {
        "@id": "adrf:ResearchPublication"
      },
      "rdfs:subPropertyOf": {
        "@id": "dct:hasPart"
      }
    },
    {
      "@id": "adrf:Researcher",
      "@type": "foaf:Person",
      "mads:hasRelatedAuthority": {
        "@id": "authority:sh85089630"
      },
      "owl:sameAs": {
        "@id": "dbpedia:Author"
      },
      "skos:definition": {
        "@language": "en",
        "@value": "An author of a research publication that uses datasets for research"
      },
      "skos:prefLabel": {
        "@language": "en",
        "@value": "author"
      },
      "skos:topConceptOf": {
        "@id": "adrf:ADRF_Ontology"
      }
    },
    {
      "@id": "adrf:youfa_wang",
      "@type": "adrf:Researcher",
      "foaf:name": {
        "@language": "en",
        "@value": "Youfa Wang"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://scholar.google.com/citations?user=cHpphu0AAAAJ&hl=en&oi=ao"
      }
    },
    {
      "@id": "adrf:DatasetProvider",
      "@type": "foaf:Organization",
      "mads:hasRelatedAuthority": {
        "@id": "authority:sh85066157"
      },
      "owl:sameAs": {
        "@id": "dbpedia:Data_publishing"
      },
      "skos:definition": {
        "@language": "en",
        "@value": "An organizaiton that publishes and curates research datasets"
      },
      "skos:prefLabel": {
        "@language": "en",
        "@value": "dataset provider"
      },
      "skos:topConceptOf": {
        "@id": "adrf:ADRF_Ontology"
      }
    },
    {
      "@id": "adrf:publication338",
      "@type": "adrf:ResearchPublication",
      "prism:publicationDate": {
        "@type": "xsd:date",
        "@value": "2010-03-01"
      },
      "dct:creator": [
        {
          "@id": "adrf:robert_lawrence"
        },
        {
          "@id": "adrf:youfa_wang"
        },
        {
          "@id": "adrf:may_a_beydoun"
        },
        {
          "@id": "adrf:benjamin_caballero"
        },
        {
          "@id": "adrf:tiffany_l_gary"
        }
      ],
      "dct:identifier": {
        "@type": "xsd:anyURI",
        "@value": "https://doi.org/10.1017/S1368980010000224"
      },
      "dct:language": "en",
      "dct:publisher": {
        "@language": "en",
        "@value": "Public Health Nutrition"
      },
      "dct:subject": [
        {
          "@language": "en",
          "@value": "Diet"
        },
        {
          "@language": "en",
          "@value": "Trend"
        },
        {
          "@language": "en",
          "@value": "Food intake"
        },
        {
          "@language": "en",
          "@value": "Meat consumption"
        },
        {
          "@language": "en",
          "@value": "United States"
        }
      ],
      "dct:title": {
        "@language": "en",
        "@value": "Trends and correlates in meat consumption patterns in the US adult population"
      },
      "cito:citesAsDataSource": [
        {
          "@id": "adrf:dataset481"
        },
        {
          "@id": "adrf:dataset_x001"
        }
      ]
    },
    {
      "@id": "adrf:Topic",
      "@type": [
        "mads:Topic",
        "mads:Authority"
      ],
      "skos:definition": {
        "@language": "en",
        "@value": "Concepts tied to LOC upper ontology http://id.loc.gov/authorities/subjects.html"
      },
      "skos:topConceptOf": {
        "@id": "adrf:ADRF_Ontology"
      }
    },
    {
      "@id": "adrf:robert_lawrence",
      "@type": "adrf:Researcher",
      "foaf:name": {
        "@language": "en",
        "@value": "Robert Lawrence"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.scopus.com/authid/detail.uri?authorId=7201490909"
      }
    },
    {
      "@id": "adrf:may_a_beydoun",
      "@type": "adrf:Researcher",
      "foaf:name": {
        "@language": "en",
        "@value": "May A Beydoun"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.researchgate.net/profile/May_Beydoun"
      }
    },
    {
      "@id": "adrf:benjamin_caballero",
      "@type": "adrf:Researcher",
      "foaf:name": {
        "@language": "en",
        "@value": "Benjamin Caballero"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://orcid.org/0000-0003-4311-6321"
      }
    },
    {
      "@id": "adrf:ResearchPublication",
      "@type": "fabio:ResearchPaper",
      "mads:hasRelatedAuthority": {
        "@id": "authority:sh2004003366"
      },
      "owl:sameAs": {
        "@id": "dbpedia:Publication"
      },
      "skos:definition": {
        "@language": "en",
        "@value": "A research publication that uses datasets for research"
      },
      "skos:prefLabel": {
        "@language": "en",
        "@value": "research publication"
      },
      "skos:topConceptOf": {
        "@id": "adrf:ADRF_Ontology"
      }
    },
    {
      "@id": "adrf:usda",
      "@type": "adrf:DatasetProvider",
      "foaf:name": {
        "@language": "en",
        "@value": "USDA"
      },
      "foaf:page": {
        "@type": "xsd:anyURI",
        "@value": "https://www.usda.gov/"
      }
    },
    {
      "@id": "adrf:Corpus",
      "@type": "skos:Collection",
      "dct:title": "ADRF Corpus of research publications",
      "rdfs:label": "ADRF Corpus",
      "adrf:publication": [
        {
          "@id": "adrf:publication340"
        },
        {
          "@id": "adrf:publication338"
        }
      ]
    },
    {
      "@id": "adrf:Dataset",
      "@type": "dcat:Dataset",
      "mads:hasRelatedAuthority": {
        "@id": "authority:sh2018002256"
      },
      "owl:sameAs": {
        "@id": "dbpedia:Data_set"
      },
      "skos:definition": {
        "@language": "en",
        "@value": "A collection of tables and metadata used within the ADRF framework, managed by a responsible party"
      },
      "skos:prefLabel": {
        "@language": "en",
        "@value": "dataset"
      },
      "skos:topConceptOf": {
        "@id": "adrf:ADRF_Ontology"
      }
    },
    {
      "@id": "adrf:ADRF_Ontology",
      "@type": "skos:ConceptScheme",
      "skos:definition": {
        "@language": "en",
        "@value": "A mid-level ontology for linked data within the ADRF framework use cases"
      },
      "skos:prefLabel": {
        "@language": "en",
        "@value": "ADRF Ontology"
      }
    }
  ]
}
ceteri commented 5 years ago

Great :) I'll see your compacted context and raise you a default vocabulary -- now in the code that generates the JSON-LD above ^^^ https://github.com/Coleridge-Initiative/adrf-onto/commit/04c4c235954cfb0cb8a39531bf545139a0baf181

ellisonbg commented 5 years ago

I think having data in this form will really help these explorations. Thanks for putting this together Paco and Nick!

On Sat, Jul 20, 2019 at 12:44 PM Paco Nathan notifications@github.com wrote:

Great :) I'll see your compacted context and raise you a default vocabulary -- now in the code that generates the JSON-LD above ^^^ Coleridge-Initiative/adrf-onto@04c4c23

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

-- Brian E. Granger

Principal Technical Program Manager, AWS AI Platform (brgrange@amazon.com) On Leave - Professor of Physics and Data Science, Cal Poly @ellisonbg on GitHub

saulshanabrook commented 5 years ago

Thanks for this! I updated my prototype with this data and we can walk around it:

Screen Shot 2019-07-25 at 2 45 28 PM
bollwyvl commented 5 years ago

A further point I'd like to raise (was a bit rushed doing that initial extension of @ceteri's stuff, as i was trying to get something out the door before traveling): I think it's pretty unlikely that most front-end tools will be building their own ontologies on the fly, most of them will either be anchored in the Big N (haven't sussed out what those would be) like DC, OWL, PROV, etc. or would be either a) semi-officially adopted by a Jupyter metadata service contract (e.g. a conformance suite) or b) extended by an existing community of practice (e.g. NASA/ESA would probably have to collaborate on something) or c) on a per-application basis. At any rate, we'd probably end up with a well-known location, e.g. $prefix/share/jupyter/ontologies for a reference, file-based implementation, but a larger-scale organization might have a schema/ontology store.

Anyhow, keep in mind that ontologies frequently just describe what Can Be Known, not what Must Be Known. For that, we would likely have to consider a further constraint language, e.g. SPIN, SWRL or the present new hotness, SHACL.

tonyfast commented 5 years ago

What if we could know everything? If we start to know everything then a complex graph forms that may difficult to manage. Each schema describes facets of the problem. dcat is designed for interoperability between open datasets & void is a schema to describe how to reuse data. It will be necessary to apply MANY schema; dcat is a good choice because it references several Big N schema.

Users will traverse small cycles in a larger graph of knowledge. If we knew schema ahead of time could we assist authors in providing better markup.

Ultimately, any semantic information is opaque to the compute. How is an author incentivized to include more knowledge in their computational documents.

This notebook treats many schema as data in a dataframe. binder

The dataframe represents a graph with over 10000 edges including some information in multiple languages.

With this information, how could we help scientists?

saulshanabrook commented 4 years ago

I am going to close this for now, since we seem to have settled on displaying arbitrary JSON LD