schemaorg / suggestions-questions-brainstorming

Suggestions, questions, and brainstorming
19 stars 15 forks source link

scope of Schema.org for research data #114

Open vsoch opened 6 years ago

vsoch commented 6 years ago

hey schema.org team!

We are putting together an organizational and data movement strategy for research computing and the library at Stanford, and I wanted to ask how the schemas fit into the domain of research data. I will describe the ideas we are discussing first to give you some context.

As it moves around, the organizational schema will help to guide interaction. It will help with validation and query and integration with tools built around it. For many of these organizations, they will come from the research domains themselves. For example, the brain imaging data structure BIDS is already being widely used across the neuroimaging community and software.

For the definition of the organization and data, I'm wondering how schema.org can fit in. I saw that Natasha (previously at Stanford!) at Google for Google Datasets (see this article) mentioned schema.org, and it definitely seems relevant for web page content and making it searchable. In that we want our strategy to be easy to sync with what the larger community is doing, I wanted to ask about research data? How can we work together and leverage the resources here so that our datasets can eventually integrate too into tools provided by schema.org, Google Datasets, and be useful for searching for our researchers after archive? How can we contribute templates and other tooling here to help toward this? Thanks for your help!

rvguha commented 6 years ago

Hi Vanessa,

Would love to talk to you about this. Can send me email at guha@google.com?

guha

On Wed, Sep 12, 2018 at 3:30 AM, Vanessa Sochat notifications@github.com wrote:

hey schema.org team!

We are putting together an organizational and data movement strategy for research computing and the library at Stanford, and I wanted to ask how the schemas fit into the domain of research data. I will describe the ideas we are discussing first to give you some context.

  • a researcher will start a new study and create a definition of data. Let's say some set of images and annotations. We will want them to be matched to a particular data format (e.g., DICOM images) that has a set of metadata (e.g., some subset of header fields, or Radlex terms).
  • Ideally, this image format will have a particular organization and metadata (something schema.org can represent?) and this will drive tools / software to move it around, and perhaps first put it where it will be used by the researcher (Google Cloud).
  • on Google Cloud,you can imagine it will be in Object Storage, with object level metadata and if needed, something like BigQuery to handle scaled queries.
  • Then the data will be moved to library archive, more of a filesystem setup, where it will be accessible by URL.

As it moves around, the organizational schema will help to guide interaction. It will help with validation and query and integration with tools built around it. For many of these organizations, they will come from the research domains themselves. For example, the brain imaging data structure BIDS http://bids.neuroimaging.io is already being widely used across the neuroimaging community and software.

For the definition of the organization and data, I'm wondering how schema.org can fit in. I saw that Natasha (previously at Stanford!) at Google for Google Datasets (see this article https://www.blog.google/products/search/making-it-easier-discover-datasets/) mentioned schema.org, and it definitely seems relevant for web page content and making it searchable. In that we want our strategy to be easy to sync with what the larger community is doing, I wanted to ask about research data? How can we work together and leverage the resources here so that our datasets can eventually integrate too into tools provided by schema.org, Google Datasets, and be useful for searching for our researchers after archive? How can we contribute templates and other tooling here to help toward this? Thanks for your help!

@cmh2166 https://github.com/cmh2166 @hannahfrost https://github.com/hannahfrost @rmarinshaw https://github.com/rmarinshaw

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/schemaorg/schemaorg/issues/2059, or mute the thread https://github.com/notifications/unsubscribe-auth/AFAlCqFv1ejorqp3BoB1weKkCW-oilfAks5uaOIugaJpZM4WlGSS .

thadguidry commented 6 years ago

damn it @rvguha can't we at least TRY to keep the conversation public , until it doesn't have to be ??? If you cannot, please at least summarize the public bits of the conversation you have, back into this issue for the benefit of all. Thanks man ! :)

akuckartz commented 6 years ago

@vsoch The "larger community" does not only consist of Google.

vsoch commented 6 years ago

That’s not what I meant :*(

vsoch commented 6 years ago

@akuckartz would you care to support that statement with a description of what the larger community is doing, per your thoughts?

rvguha commented 6 years ago

Thad, Andreas,

I have been wanting to get hold of someone from the team that Vanessa is part of, for many months, in the context of a different project. I do sincerely apologize for not making that clear.

Please do go ahead and continue your discussion here about the original topic.

guha

On Wed, Sep 12, 2018 at 10:15 AM, Thad Guidry notifications@github.com wrote:

damn it @rvguha https://github.com/rvguha can't we at least TRY to keep the conversation public , until it doesn't have to be ???

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/schemaorg/schemaorg/issues/2059#issuecomment-420727241, or mute the thread https://github.com/notifications/unsubscribe-auth/AFAlCsteGsmcyla_UxJ0O8EMVfW_lx2Aks5uaUEVgaJpZM4WlGSS .

vsoch commented 6 years ago

yeah! @thadguidry and @akuckartz I hope we can continue our conversation here, can I tell you how excited I am to be starting work on this project? I understand your concern, and let's keep discussion going here! One small note - please take note I'm about to be hit by a hurricane (east coast) so if there is a bit of delay in my response, I'm probably just away from power or internet. :ship: :boat:

rvguha commented 6 years ago

Stay safe!

On Thu, Sep 13, 2018 at 7:50 AM, Vanessa Sochat notifications@github.com wrote:

yeah! @thadguidry https://github.com/thadguidry and @akuckartz https://github.com/akuckartz I hope we can continue our conversation here, can I tell you how excited I am to be starting work on this project? I understand your concern, and let's keep discussion going here! One small note - please take note I'm about to be hit by a hurricane (east coast) so if there is a bit of delay in my response, I'm probably just away from power or internet. 🚢 ⛵️

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/schemaorg/schemaorg/issues/2059#issuecomment-421034869, or mute the thread https://github.com/notifications/unsubscribe-auth/AFAlCrbpgaaV6mpYe3Rm1sPYOf9RB2S9ks5uanDJgaJpZM4WlGSS .

danbri commented 6 years ago

@thadguidry - I appreciate your enthusiasm to collaborate but there's nothing wrong with @rvguha (or anyone else) expressing an interest in directly meeting up with other members of this community, especially given that he's in the Stanford area etc etc.

Saying "damn it" in Github comes across way more snarkily than it might in real life amongst people who know each other better. While you might mean it with a smile it's not a great example to set.

From https://schema.org/docs/howwework.html

Participants in community group and Github discussions are expected to respect the W3C code of ethics and professional conduct, as well as each other.

While I don't take use of "damn it" as breaking those rules but it needlessly nudges things towards being a more hostile and critical environment. Anyway, FWIW I'd also be happy to see more discussion of the original ideas here, but since the original post also had several Google mentions, maybe those Google-specific aspects are better explored elsewhere.

(As an aside -- I've been working with Guha on this RDF stuff since before Google even existed as a company, I'd hope our commitment to the bigger picture might be clear by now...)

akuckartz commented 6 years ago

@vsoch In addition to schema.org there exist several other parallel activities with overlapping (but certainly not identical) requirements and stakeholders. One of them is the W3C Dataset Exchange Working Group (DXWG) which is creating a revised Version of DCAT. See https://w3c.github.io/dxwg/

This is not an "either or" but I suggest that you also look at what DXWG is doing. In Europe DCAT is used frequently by public administrations and in Germany there is a new legal requirement for all branches of the public administration to describe Open Data using a DCAT profile. I suppose that will also have some influence in research communities.

And yes, stay safe!

danbri commented 6 years ago

@akuckartz yes, we have the same "backbone structure" in Schema.org as DCAT is based on. This is thanks to Jim Hendler and his group's advocacy for us to adopt that design some years ago

The basic structure is

(my DataCatalog) ---dataset(inverse=includedInDataCatalog)  ---> (my Dataset) ----distribution--> (my DataDownload)

... which shared strengths and weaknesses with DCAT, e.g. there is scope for documenting patterns for time series-based collections, etc. The similarity means that if you have basic DCAT it is pretty easy to generate Schema.org Dataset markup, and vice-versa.

I am a member of W3C DXWG WG representing Google, and liaising to Schema.org. There are some notes from the last f2f meeting that I made, towards using JSON-LD's @ context feature to integrate DCAT / Linked Data approaches and schema.org here: https://docs.google.com/document/d/16c_STDu8Dzj-ioRNuGS2tlIFJamlx0-vRKBaPA5Wzfc/edit

Beyond this high level DCAT / Schema.org/Dataset approach (which is barely a change from classical 1990s Dublin Core), there are lots of other aspects to dataset description opening up, and lots of questions about how different standards plug together in practice, even just looking at W3C stuff like CSVW and Data Cube. I've recently been spending a lot of time around Fact Checking initiatives, in the context of misinformation and Schema.org's Claim and ClaimReview markup. In that context there are some DXWG discussions on representing caveats and footnotes from statistical data at https://lists.w3.org/Archives/Public/public-dxwg-wg/2017Jul/0041.html which may be of interest here. There are also efforts like Bioschemas who are starting to crawl this data and elaborate a few schema.org additions that help bridge the cross-domain dataset descriptions with domain-specific identifiers and ontologies.

DmPo commented 6 years ago

Hi Vanessa,

Good question!

My (Stanford-CDDRL-backed) counter-disinfo project is using an extended version of schema.org/ClaimReview on the basic level and it would be great to have Stanford library resources easily integrated.

So please keep me in the loop.

Thank you!

Best regards, Dmytro Potekhin Founder & CEO FakesRadar.org https://fakesradar.org/

On Wed, Sep 12, 2018 at 1:30 PM Vanessa Sochat notifications@github.com wrote:

hey schema.org team!

We are putting together an organizational and data movement strategy for research computing and the library at Stanford, and I wanted to ask how the schemas fit into the domain of research data. I will describe the ideas we are discussing first to give you some context.

  • a researcher will start a new study and create a definition of data. Let's say some set of images and annotations. We will want them to be matched to a particular data format (e.g., DICOM images) that has a set of metadata (e.g., some subset of header fields, or Radlex terms).
  • Ideally, this image format will have a particular organization and metadata (something schema.org can represent?) and this will drive tools / software to move it around, and perhaps first put it where it will be used by the researcher (Google Cloud).
  • on Google Cloud,you can imagine it will be in Object Storage, with object level metadata and if needed, something like BigQuery to handle scaled queries.
  • Then the data will be moved to library archive, more of a filesystem setup, where it will be accessible by URL.

As it moves around, the organizational schema will help to guide interaction. It will help with validation and query and integration with tools built around it. For many of these organizations, they will come from the research domains themselves. For example, the brain imaging data structure BIDS http://bids.neuroimaging.io is already being widely used across the neuroimaging community and software.

For the definition of the organization and data, I'm wondering how schema.org can fit in. I saw that Natasha (previously at Stanford!) at Google for Google Datasets (see this article https://www.blog.google/products/search/making-it-easier-discover-datasets/) mentioned schema.org, and it definitely seems relevant for web page content and making it searchable. In that we want our strategy to be easy to sync with what the larger community is doing, I wanted to ask about research data? How can we work together and leverage the resources here so that our datasets can eventually integrate too into tools provided by schema.org, Google Datasets, and be useful for searching for our researchers after archive? How can we contribute templates and other tooling here to help toward this? Thanks for your help!

@cmh2166 https://github.com/cmh2166 @hannahfrost https://github.com/hannahfrost @rmarinshaw https://github.com/rmarinshaw

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/schemaorg/schemaorg/issues/2059, or mute the thread https://github.com/notifications/unsubscribe-auth/AEaASckgfh_pT02NE46CAcKmmOBCpJyWks5uaOIugaJpZM4WlGSS .

thadguidry commented 6 years ago

@danbri Noted about use of "damn it". Sorry, bad day at Ericsson. My apologies. Thanks for noting you also would like to see as much discussion as can be had in this issue as well.

vsoch commented 6 years ago

hey everyone this information is really fantastic! What I'm doing is starting my exploration from the point of version control - any schema that we use and then tooling to move the data it describes is going to start with Github (and I'm realizing, Git Annex). So my plan is the following:

which may be of interest here. There are also efforts like Bioschemas who are starting to crawl this data and elaborate a few schema.org additions that help bridge the cross-domain dataset descriptions with domain-specific identifiers and ontologies.

Having templates and entrypoints for say, a biologist to easily plug into the right schema, and then use it to move data from a local place to Google Cloud (I don't mean to preference a vendor but the Stanford hospitals are heavily invested and this will be a first use case!) and then to the library archive when "live work" is done is my desired goal.

I'm also really glad to hear that the the various technologies are related - it makes life much easier, but also says good things about the people and communities involved.

Early Goals

So here is what I'm setting out to do first - and of course this will change as I learn more! I think in my first "dummy test" I am going to try and see how far I can get doing the following:

Use Case Driving Goal

  1. Start with dataset locally, some file organization and metadata
  2. Choose an organization from a nice repository / web interface
  3. Use provided validators to put data into chosen file organization, and add files
  4. Use datalad (git-annex) to move files to a Google Cloud Storage. Do magic.
  5. Use datalad again to move to some (fake) archive, basically another server

Additional Tooling will Mean

I will link this issue in my notes so I remember to post updates along the way! I really like Github issues for this kind of discussion, and to be honest get a little lost in Google Doc comment chains. If it's okay with you, we might keep this issue open to have further discussion in the coming weeks.

danbri commented 6 years ago

@DmPo interesting- do you have a pointer to the details? Is the newish Claim type of any use for your approach?

DmPo commented 6 years ago

Hi @DanBri, I will have a pointer later this week - launching a new site right now. Yes, sameAs is a good idea - it can be used for linking reposts of a fake to the "original" fake. This makes easier debunking of the reposts of the same fake.

vsoch commented 6 years ago

hey everyone! A quick update after some weekend work. I will start with the use case and walk through the steps I've taken so far (and where noted, where I have a question or two). Just as a note I sent this out to a few of you via old school email too :)

  1. I started with a dataset. it describes Container recipes. This is my dummy case - I want to use them on Google Cloud and then archive them somewhere.
  2. I realize that there is no description in schema.org to describe a Container. So I creating tooling that: a. starts with Google drive files, asks the user to write the specification and export to .tsv b. fork a template repository and just add their files to a folder! Connecting to continuous integration then is ready to go to generate the specification using a docker container. The version controlled finished bundles are sent back to Github pages (and they could be rendered pretty here, but I haven't made a template yet) c. The next step would be that the user does a PR with the generated files to (wherever the schema.org specifications are submit? **I would want some guidance here about "what files do I need to submit, and where/how, for schema.org?" From what I can tell I need to generate some JSON-LD and examples, and provide a web interface, and I can do this (just need a bit of time!)

For all of the above, I'm incapable of moving forward without having a nice web interface to describe the process (beyond the Github READMEs) and I always want a "specifications" repo for a user to be able to submit a specification output to (for one cohesive specifications repo) so I'm going to work on this next. Once I have this web interface and Container specification list (the discussion I hope to start here) I'd like to submit it proper to schema.org (as an extension?) and then go back to describing my Dockerfile dataset! I am thinking of using Datalad to move things around, and notably for the git-annex functionality.

And I'd really like to bring those interested here into working on this tooling under openschemas! Just let me know you are interested and I can add you. I realize that some of the tech / other open science related things don't fit well under bioschemas (where I was originally contributing) so I made this organization, and it falls nicely alongside the openbases that are provided templates for doing reproducible things for open science, generally.

Full circle, cue Lion King music :) That's the plan for now! I don't think I need any help or have question beyond the bolded above. Whatever information you relay to me I can also write clearly into the web interface I'm making, so it will be good use of time. I'm going to work on that later today - likely taking a quick break to go for a run :running_man:

vsoch commented 6 years ago

Quick update! @ricardoaat has taught me the updated specification format to feed into the web interface (and schema.org) so I'm updating map2model to generate that (see this quick links in this issue), and I've prepared the web interface I was talking about to serve them (a sibling of the beautiful bioschemas.org!). I'll thus:

Note that the repos / site are pretty bare bones, I'll be working heavily on them this week.

vsoch commented 6 years ago

Quick update:

I'm working on testing for the submissions to specifications next (in map2spec) and then I'll finish up the Container* family of new specifications (hopefully this weekend?) and then (finally) try using the definitions to describe the dinosaur Dockerfile dataset (ContainerRecipe)

vsoch commented 6 years ago

Small distraction - I converted the spec-container fully into a spec-template that is now added as an openbases template. This means adding a new badge for specification builders ("spec") that uses the same red from schema.org (the darker one) as a fun easter egg :) The full template falls within openbases because the user just needs to fork, add some file content, and then build on circle and they get artifacts / ghpages artifacts, and a web interface for their drafts.

Same plan mentioned before for next steps (testing then Container definitions!) @rajido also reached out to me today and we are going to talk about labeling the biocontainers, which will be wicked!

vsoch commented 6 years ago

<update> I started ContainerRecipe draft. The delay is because I set up the entire validation library in openschemas-python, and then integrated it into the specifications repo, so now the specifications repo is ready to have the files (like ContainerRecipe.html) submit and tested properly with PR (see the second to last tab here). I'll be doing the Image/Distribution and others soon, and then submitting to the specifications repository for more feedback (and testing of course), along with more robust docs for openschemas-python </update>

vsoch commented 6 years ago

Another update. We now have ContainerImage! I'm hoping for discussion / feedback from OCI, and then to generate some kind of official submission for a schema.org extension. Is there a documentation base for how to do that? Discussion with OCI (I hope) will happen here --> https://github.com/opencontainers/image-spec/issues/751#issuecomment-424657623 and that's also where you can find links, if interested.`

My next steps would be to:

And in there somewhere I'll clean up the docs and write some "hey you can do it too!" material to help researchers with weird datatypes that warrant a specification to contribute.

RichardWallis commented 6 years ago

Developing / testing / proposing updates, additions, extension, etc. to Schema.org - background reading:

vsoch commented 6 years ago

okay so I'm trying to follow some of the documentation and from what I can tell:

@satra advised me that I shouldn't have new properties so I just removed any new ones (with parent "Container") from the proposed one that I made,but I'm a bit confused about this because the example about definitely has a bunch that are categorized under "Legislation" (and that is the proposal).

I had started one here mimicking Bioschemas, but it seems like the suggested thing is to use the app that is provided at this repo. Should I blow up what I've done and start again? As a newcoming I find this entire process and the discussion really confusing, for what it's worth. There is no clear checklist or set of steps for starting from scratch to making a submission other than long verbose pages across many places and I'm doing my best but struggling quite a bit. Guidance (specific, stepwise things) would be appreciated! Thank you!

vsoch commented 6 years ago

And just a quick comment, this is a huge issue, from the view of a developer:

We expect collaboration will focus more on schemas and examples than on our supporting software.

If an expert with a schema or domain does not find it easy to contribute, that is a failure state I think. I want to suggest that we can achieve both good software and specifications, and they can live together in harmony.

rvguha commented 6 years ago

I am not sure why you shouldn't suggest new properties and classes. After all that is the point, right?

And yes, we do need to evolve so that we can start collecting and redistributing software. Part of the problem is that this git mailing list is really the mailing list for the appengine code that runs the schema.org site, and having to bring up a new one is probably not the path of least resistance for making a contribution.

My personal preference (and I am sure this is just a reflection of my cruftiness) is a simple English description of the terms and classes you propose to add. On this list.

The most important part of the proposal is who you think will use it and why the community will benefit if everyone uses it. Too many proposals are from 'professional ontologists' (like I used to be), who would like to see a schema for their favourite topic.

guha

On Mon, Oct 1, 2018 at 4:45 PM Vanessa Sochat notifications@github.com wrote:

And just a quick comment, this is a huge issue, from the view of a developer:

We expect collaboration will focus more on schemas and examples than on our supporting software.

If an expert with a schema or domain does not find it easy to contribute, that is a failure state I think. I want to suggest that we can achieve both good software and specifications, and they can live together in harmony.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/schemaorg/schemaorg/issues/2059#issuecomment-426099886, or mute the thread https://github.com/notifications/unsubscribe-auth/AFAlCmI7O5ODhegrQjdBGnPC7VeLk_t3ks5ugqkHgaJpZM4WlGSS .

vsoch commented 6 years ago

Thank you for this feedback @rvguha ! I had just cloned this repository, and was going to give a crack at creating a Dockerized local template that could be run to produce a (local generation) of the example site. But it sounds like this isn't priority, at least as long as there is some method to share a specification suggestion? In this light, does the template that I created, perhaps with better description of the classes, suffice for discussion? I was aiming to make it easier for contribution - so far the workflow is:

The template generates the static files of the specifications in yaml, and I would want this template to also generate (whatever format of file) is needed for a "real / final / etc." contribution to schema.org (json-ld?).

The goal of the steps above is that any domain expert can generate a version controlled, and easy to have discussion over web interface and "final submission" files without knowing anything about software engineering / version control etc. This is a very easy path to contribution, but if it's the case that the templates I've made thus far aren't producing the kind of content that could be discussed, I need to step back and re-work them. If they are in the right direction, then I would suggest that I:

  1. put more time into the human elements (describing the properties / examples) and
  2. then adding the data structure format that would be needed for a final submission to schema.org

And although it's not priority, is there interest in generating a dockerized local version of the app engine deployment as a (possibly better) alternative to preview / discuss a specification contribution? As the Legislature group did, it looks really sharp. I notice that the template here has a nice switcher at the bottom for looking through different data structures of the specifications, and minimally I'd like to add that to the template that I am working on, because I'm guessing one or more of those structures is the final file(s) to be submit.

As for rationale - holy cow! It's so badly needed it almost feels silly to restate, but I can briefly comment now. Containers are sort of hugely important for anything and everything! A container is the currency of reproducible deployments, and with technology like Singularity even of scientific compute (because we can run on shared cluster resources). Without a specification that describes the components of images, recipes, and distrubtions (registries) we are living in a messy universe where the best I can do to find a "tensorflow" container is do a search on Docker Hub and pray that the random container I choose (based on tags or a name) might be what I'm looking for. The way we solve this problem is by way of getting Google's Help, meaning having containers join into Dataset search. In that schema.org specifications are the driver of this, we simply must have these definitions. They must follow the OCI specifications, and move in parallel with them so that we aren't inventing new things and making it harder. This (in my mind) seems like such an easy problem to solve, the slow and hard parts are just putting these pieces together. This is what I'm trying to do - and since I've found this challenging I've been trying to make a template and more programmatic way for (the next person who wants to contribute) to make it easier.

Some quick links:

In a nutshell, we need a scaled way to not only provide a list of containers, but better information about them that can be searched. I think this goes beyond the abilities of what each tiny maintainer can do,but a massive search engine like Google can do fairly easily. It would be trivial for the maintainers of:

and other registries to be able to add metadata tags to the container registry pages, and then have a beautiful way for a scientist to not find just a tensorflow container, but the tensorflow container for the purpose that he/she needs! Without this, the universe of containers is just a mess. So much awesome development and tooling will come, and an ability to better compare methods and software when we have this labeling.

Please let me know your feedback! I don't want to just talk about these things, I want to make them happen.

danbri commented 6 years ago

These investigations sound very interesting but I'm having trouble keeping the various levels of abstraction clear in my head, to be honest. Am I right that we're talking both about the specific tooling for Schema.org collaboration ("I had just cloned this repository, and was going to give a crack at creating a Dockerized local template that could be run") as well as working towards new schema.org schemas to describe things around containerized software, services and datasets?

If that's correct I'd suggest breaking out the Schema.org project tooling aspects into another issue. @RichardWallis has been working to minimize some of our dependency on AppEngine; today's commits bring us closer to a static file-based generation and serving model. For those looking at dockerizing schema.org's current tooling, the Travis-CI config might be useful.

On the vocabulary front there is a grey area in between http://schema.org/Dataset and the more software-oriented types, where container technology is increasingly central. This came up a bunch when I was talking to folk like bioschemas. It would be useful to understand how container descriptions could be integrated into both schema.org-based Dataset description, and also into W3C DXWG WG's DCAT efforts. The structures are very similar. Perhaps @agbeltran can offer some advice, as she has a link in both those efforts.

vsoch commented 6 years ago

Am I right that we're talking both about the specific tooling for Schema.org collaboration ... as well as working towards new schema.org schemas to describe things around containerized software, services and datasets?

Correct! I had only intended to do the second, but I found the first so hard that I decided to try and help as I was doing the second. I'm mostly done with the goals I had outlined for the first, and am very happy to discontinue working on it in favor of the second thing (my original goal).

If that's correct I'd suggest breaking out the Schema.org project tooling aspects into another issue. @RichardWallis has been working to minimize some of our dependency on AppEngine; today's commits bring us closer to a static file-based generation and serving model. For those looking at dockerizing schema.org's current tooling, the Travis-CI config might be useful.

This sounds great! I am a big fan of continuous integration :) @RichardWallis I won't throw any more wrenches into the mix, I'll stick with the openschemas templates I'm using now because I'm mostly done, but please reach out to me if I can be of help.

On the vocabulary front there is a grey area in between http://schema.org/Dataset and the more software-oriented types, where container technology is increasingly central.

Exactly. This is why I created openschemas. Take a look there at the link in the first lines of the opening paragraph, which I wrote some time ago now. It makes this exact point. :)

This came up a bunch when I was talking to folk like bioschemas. It would be useful to understand how container descriptions could be integrated into both schema.org-based Dataset description, and also into W3C DXWG WG's DCAT efforts.

I don't think containers belong with Datasets, or with Biology related things. A container is definitely not the right place for a dataset (although it interacts with them) and there is definitely no exclusive tie to bioschemas / biology or even a scientific domain. It's an open source technology that is useful for many things, and deserves it's own sort of bucket (or minimally shouldn't be forced artificially into a bucket it doesn't belong).

The structures are very similar. Perhaps @agbeltran can offer some advice, as she has a link in both those efforts.

That would be great! Here are the full set of specifications, for recipes (e.g., a Dockerfile), images (e.g., the actual container binary, which might have some data but really is moreso likely to be software to interact with data) and then the base of those things is the more abstract Container).

https://openschemas.github.io/specifications/

Happy Hacktoberfest everyone!! :jack_o_lantern:

thadguidry commented 6 years ago

Hi @vsoch

OCI already has good overlap in its Container Image image-spec annotations I just noticed, with our Schema.org/CreativeWork and this is a great starting point.

For datasets that are provided sometimes within Container Images... how is the metadata typically captured for those datasets ? What metadata standards are used commonly in Scientific realms besides those listed here if you know ? (Myself, Dan, and others are aware that there are other standards used to capture metadata about datasets, such as https://frictionlessdata.io/specs/data-package/ )

danbri commented 6 years ago

@vsoch - thanks, I have a clearer picture now. Just replying quickly on one point

I don't think containers belong with Datasets, or with Biology related things. A container is definitely not the right place for a dataset (although it interacts with them) and there is definitely no exclusive tie to bioschemas / biology or even a scientific domain.

I entirely agree. It's rather that the usecase arises there (and elsewhere). Organizations, events, scholarly articles etc also have identifiable and nameable relationships with data and datasets, but are fundamentally different. My thinking was just that we may want to seek out sanity checks from those working under the "dataset metadata" and lifescience/bio banners, to help work out which schemas meet which needs.

vsoch commented 6 years ago

@thadguidry oh interesting! I think an annotation for a container coincides with a LABEL (e.g., it's called LABEL in the dockerfile, or %labels for a Singularity container, so the field that I have for annotations would perhaps be a new property classified as a kind of creative work?

For containers, it's typically bad practice to try and use them to provide datasets, at least any significant ones in terms of size. For container metadata, however, the standard is to provide it via an inspect command (e.g., docker inspect <container> or the similar inspect I implemented back in the day for Singularity (singularity inspect <container>). The metadata usually consists of:

I'm not super experienced with metadata for datasets, but I'd expect to see minimal things like versions and software versions. Another important bit for containers (not listed above) are build time needs / host dependencies (e.g., nvidia-docker or similar). You can see an example of fields that a user wants to tag for a database of dockerfiles here --> https://github.com/vsoch/dockerfiles/issues/4 it's primarily tags based on "what software is inside here?" and "what vendors are relevant?"

thadguidry commented 6 years ago

@vsoch Just so you know... We have an equivalent Tags convention in Schema.org with our https://schema.org/keywords property.

You might also already be aware of this, but... Images and Containers are 2 different things... I was previously talking about Images, where you would get output about metadata of what an image and its possible data might contain with the output of docker image inspect MyImageName

vsoch commented 6 years ago

ah this is very good! I'll see if I can integrate these things (tags and annotations discussed above) into the container specification(s), and also provide a "Where does it fit?" simple diagram to share here. Likely tomorrow - need to eat some dinner. :plate_with_cutlery:

thadguidry commented 6 years ago

@vsoch no problem. Also we have a "generic" Key:Value system that can be used when there is no Schema.org Property already created yet ... https://schema.org/PropertyValue

So right now, in my opinion, I would say Schema.org has 100% of what you need to describe structured data about Containers and Images and their metadata. (The existing properties we have might not be the best fitting, but they can fit and be understandable for most search engines and structured data parsers) If you don't find a property within Schema.org to hold your structured data about Containers and Images and Datasets...then let's talk about those, and we can gladly point to possible candidates. (incidentally, a few weeks ago, I did look at your use case with BIDS data and specifically around how PyBids handles the metadata around it within its functions here: https://github.com/bids-standard/pybids/blob/master/bids/variables/variables.py and I didn't see anything that Schema.org couldn't handle currently in some fashion, again, perhaps not always the best fitting, but it could be expressed with Schema.org's current Types and Properties ) So when metadata doesn't seem to fit well, we can talk about those.

ptsefton commented 6 years ago

Hi all, I'd like you point you to an effort we have been working on that uses Schema.org for packaging data (which is not containers or images). This uses almost 100% pure schema.org to describe data, and seems to be compatible with Google's dataset search. https://github.com/UTS-eResearch/datacrate/tree/master/spec/1.0.

See the parts about file provenance using schema:CreateAction.

Does any of this help?

charlesvardeman commented 6 years ago

Not certain if you will find this useful. As part of the DASPOS project, we developed a "Computational Environment" ontology design pattern (http://ceur-ws.org/Vol-2043/paper-03.pdf) in collaboration with CERN to capture the provenance of environment where HEP calculations are performed. As part of the process, we looked at both VMWare and Dockerfile vocabulary to inform the pattern that it captures a broad set of the vocabulary. The pattern can be populated via a script with instances from Wikipedia (and Wikidata). The OWL for the pattern is in (https://github.com/Vocamp/computationalEnvironmentODP). There is a matching "Computational Activity" pattern (https://github.com/Vocamp/ComputationalActivity) that captures the provenance around a computational execution that links to the Computational Environment pattern. See the concept map: https://github.com/Vocamp/ComputationalActivity/blob/master/concept-map/computationalActivity.pdf. We didn't get to the (general) patterns to describe the underlying data sets used in a particular computational activity.

I had a student work on a proof of concept "smart containers" (http://linkedscience.org/wp-content/uploads/2015/04/paper2.pdf) tool that wrapped the docker command line tool capture the provenance of the Docker operations using the ODPs and attach them as a label to the docker image. the code for this, somewhat functional prototype is in the smartcontainers https://github.com/crcresearch/smartcontainers repo.

vsoch commented 6 years ago

hey everyone! I did a reverse of plan - I realized that I needed to bring up discussion for "Where does it fit" before updating the specification, so let's start with that :)

I'll start with (texty) discussion here about where I think each component fits into the currently existing schema.org. After discussion here, I'll update the specifications files to reflect what we discuss. I apologize in advance for probably not using terms / descriptors / properties correctly - please feel free to correct where I'm off.

Previous Art

There is an ontology that describes virtualization but I think the detail might be too much for the goals of schema.org. However, this led me to step back and approach this by asking the simple question of how much do we need to represent to achieve the current goals?

What are your goals, dinosaur?

This will organize our container universe, and be essential not just for academia, but for industry and all domains it touches.

What level of abstraction is ideal?

Representation of containers that is too detailed is actually just as bad (I think) as not having enough representation, period. It might be useful for knowing the version of a kernel if I want to know if I can use a Singularity Container there (for example, Centos 6, no overlayfs, ruhroh) but there are a lot of intricate details that might be useful in only 1% of cases. And having all the extra support for those 1% actually makes the specifications really complex and confusing. So for this first go, I would suggest we try to hit the core needs of the top 80%, and favor simplicity with the mindset that if additional need is there, the community will step in and express it.

Where do these Container specifications fit in?

The original specifications I had in mind were:

but now I realize I think it's a bit more to that. Let's talk about this, and I'll address them one by one via questions, and explaining my thought process.

Question 1: Where does container fit in?

I started with a very simple question.

Is a linux container a kind of software?

Meaning SoftwareApplication. In that it's a binary, I think that we could fit it under SoftwareApplication. That would look like this:

Thing > CreativeWork > SoftwareApplication > Container

That is the easy answer, because it fits into existing specifications in schema.org. But more correctly, if we are to also eventually model virtual machines, then we really need hypervisors too. And hmm, I don't think a virtual machine is a kind of SoftwareApplication persay, it's something else. It has software applications! So let's move it up, and I don't even think it belongs under CreativeWork to be grouped with poetry and books and what not. It doesn't have a good parent in the base. We would need something like:

Thing > Virtualization > Container
Thing > Virtualization > Hypervisor

But at some point, we are going to care about operating systems, hardware, and hosts. The hosts are the machines with hardware, and the operating systems are what the virtualization deploys. So we need something like:

Thing > Hardware
Thing > OperatingSystem
Thing > SoftwareHost
        - runsOn Hardware
        - supports Virtualization
        - has OperatingSystem

And our virtualization then relates to those things:

Thing > Virtualization
        - isSupportedBy SoftwareHost
        - has OperatingSystem

and now adding containers and hypervisors, they can inherit through this graph

Thing > Virtualization > Hypervisor
Thing > Virtualization > Container
                         - has OperatingSystem
                         - buildFrom ContainerRecipe (sometimes)
                         - has SoftwareApplication (sometimes)
                         - has annotations

So while I don't think we want a super detailed organization of hardware and virtualization, I think it should be represented on a high level because it's going to be the case that these are important parts of describing containers. For the above - I am modeling the level of Container instead of just ContainerImage because I'm not sure if we can have a Container that isn't associated with an image. It could be that the properties above should just belong with a ContainerImage that is the child of Virtualization.

Question 2: Where does container recipe fit in?

A container recipe refers to a set of build steps for a container, which is a binary that has an operating system and associated libraries. Examples include Dockerfiles and Singularity recipes.

If we want a quick and dirty solution, in that it's a template or script, a ContainerRecipe (sort of) fits under the category of SoftwareSourceCode. But I'm not sure I would call a container itself software, and then we are walking the fine line of not properly distinguishing a Singularity or Docker container from, for example, software that runs them (e.g., docker or singularity). But it does fit the spirit, so maybe it can be a kind of SoftwareSourceCode?

Thing > CreativeWork > SoftwareSourceCode > ContainerRecipe

If we consider a container to be a kind of software (it is a binary...) then that fits pretty cleanly. Does anyone else have thoughts about this?

Question 3: Where does container image fit in?

The container image isn't the running instance, but the binary (for Singularity the actual file) that generates the instance. It's weird, yeah. If we consider Container to be the more abstract thing, if it could be the case that there are kinds of containers that don't require images, then it would be a child:

Thing > Virtualization > Container > ContainerImage

And then the instance I think would coincide with the container runtime, so we would have this:

Thing > Virtualization > Container > ContainerImage > ContainerRuntime

The OCI has specifications for images and for runtimes, so this is logical to model both. The ContainerRuntime is the instance generated from the ContainerImage, which is a type of Container, a kind of Virtualization technology.

Question 4: Where does container distribution fit in?

A container distribution is a container registry (Docker Hub, Singularity Hub, Quay.io, Biocontainers, etc.) Would it be a subtype of a collection?

Thing > CreativeWork > Collection > ContainerDistribution
                                    has Container (many)

So basically, it's a collection of Container, or ContainerImage. The OCI also has a specification for registries (generally the information they serve and manifests) so we would represent that here.

@ptsefton data crate looks really cool and I definitely think it could be useful once we have these definitions (I've added an issue to give it a try along with datalad for my test dataset!) and @charlesvardeman this is also very useful - have you thought about bringing up these descriptors with the OCI maintainers so it can be linked to a container runtime? I would say we would want to model them separately in schema.org (the idea of the Computational Environment) and then say something like ContainerRuntime needs ComputationalEnvironment ... but since the strategy is to go by the properties of OCI (and not roll our own, to the best that we can) I think the most logical avenue is to try and integrate there first, OR contribute an independent specification of a Computational Environment here, and then convince OCI to embrace it too?

Wait... what about this computational environment?

Actually, this is a very good point, because a computational environment might describe one or more hosts, and this is the kind of thing you would call any sort of cluster. But again I would challenge us to reduce the complexity to a level that can be extended, but doesn't over-complicate based on the goals of having it. In my little framework I'm describing here, based on looking at your chart, it would seem that you are suggesting a ComputationalEnvironment is parent to all these things? Something like:

Thing > Hardware
Thing > OperatingSystem
Thing > ComputationalEnvironment
Thing > ComputationalEnvironment > SoftwareHost
                                   - runsOn Hardware
                                   - supports Virtualization
                                   - has OperatingSystem

Here is where it gets kind of cool! I find this interesting because a computational environment could moreso refer to a collection of hosts and hardware (e.g., Kubernetes, SLURM / SGE) - meaning multiple SoftwareHosts OR for the humans among us, just a single SoftwareHost. So instead of SoftwareHost being a kind of ComputationalEnvironment, it becomes a link / (properties?) instead:

Thing > Hardware
Thing > OperatingSystem
Thing > SoftwareHost
Thing > ComputationalEnvironment 
          - has SoftwareHost (many) each of which...
            - runsOn Hardware
            - supports Virtualization
            - has OperatingSystem

I like that better :) Let's put this all together to look at in one place, and I'll leave it open for discussion! I want to suggest that I can take charge of creating specifications for review for the Thing > Virtualization hierarchy, and perhaps @charlesvardeman your group has developed the Computational Environment specifications, and could define a subset to fit into schema.org?

Thing > Hardware
Thing > OperatingSystem
Thing > SoftwareHost
Thing > ComputationalEnvironment 
          - has SoftwareHost (many) each of which...
            - runsOn Hardware
            - supports Virtualization
            - has OperatingSystem

Thing > Virtualization
        - isSupportedBy SoftwareHost

Thing > Virtualization > Hypervisor
Thing > Virtualization > Container
                         - has OperatingSystem
                         - buildFrom ContainerRecipe (sometimes)
                         - has SoftwareApplication (sometimes)
                         - has annotations

Thing > Virtualization > Container > ContainerImage > ContainerRuntime
Thing > CreativeWork > Collection > ContainerDistribution
                                    has Container (many)

Let's circle back to our original points - first the goals. The above would allow for nice labeling of containers with software and data, for containers served in registries, that then could be indexed by Google and the properties exposed for not just discovery, but for "grid type" analyses to answer questions like "What is the optional computational environment to run Container X?"

Second a mindset of simplicity - for many of these, we can leave them to be very simple / general (like shells) and have the community come in and make contributions for the details. I think it's our job to set up the skeleton / framework, and not to try and get the entire detailed thing perfectly.

That was a lot more than 0.02, so I'll say there is my 2 dollar pancake. :pancakes: :)

charlesvardeman commented 6 years ago

@vsoch I would be happy to help. I’m traveling over the next couple of days and need some time to digest your suggestions and write some comments. One other comment that may give food for thought. We started developing a pattern call ComputationalObservation that was akin to O&M or SOSA observation for a computational result. Here is a brief talk that I gave to ontolog. http://ontolog.cim3.net/file/work/OntologySummit2015/2015-03-05_OntologySummit2015_Beyond-Semantic-Sensor-Network-Ontologies-2/Track-B_OntologySummit2015_CharlesVardemann_2015-03-05.pdf. Side 18 shows how computational mode, algorithm, SoftwareAsCode, Library and Execution tie together for a computational observation.

vsoch commented 6 years ago

Have a safe trip, and looking forward to your thoughts! :airplane:

If the container is the box of pancake mix, the cabinet and then house are the computational environment and hardware, respectively, this is one level deeper - the computational observation is actually everything that goes into creation of the pancake mix (the algorithm to grind the flour, the kitchen it was done in, the amounts, etc.). I think this is important too, and probably should be represented independently even from the small hierarchy we are discussing. For example, you can easily have a component of a ComputationalObservation without any Hardware or a Container. For an algorithm, well couldn't that even be in my head? :thinking:

Anyhoo, have a safe trip and let's talk about all of the above when you have some time! I want to also look more closely at how schema.org is generating the final specifications because given the need for json-ld and similar, I'm now not totally happy with just having yaml. But I don't think there is rush for this development because as @rvguha pointed out, the important first thing to do is have discussion (and we can do that right without any special tools :) )

HughP commented 6 years ago

@vsoch

I'm just getting caught up on this thread, and I was reading:

Thing > ComputationalEnvironment > SoftwareHost

  • runsOn Hardware
  • supports Virtualization
  • has OperatingSystem

And that sounds a bit like some some of the features in the DOAP ontology. Are you aware of that ontology? and is it useful for what you are doing here? - I'm suggesting that it might be and there is already some use of it on PyPI and several other software repositories. Here is the github link for the project: https://github.com/ewilderj/doap . Some years ago there was a paper about DOAP presented ad the DCMI meeting: DCMI-Tools: Ontologies for Digital Application Description.

vsoch commented 6 years ago

Thanks @HughP.

To all in the discussion - to be clear my goals right now aren’t to delve into describing software, repos, or projects- some of the extended discussion about was just suggestion for how additional (more detailed) descriptions of software or projects could fit into what I’m thinking of. My primary goal is to describe the levels needed up to having a container. Indeed in containers there is software that might be described by these additional ontologies and down the line I would definitely enjoy helping add these to schema.org, but right now they are out of scope for this discussion.

thadguidry commented 6 years ago

@vsoch The world moves quickly in regards to Container standardization (and around metadata) I don't want to fill up a book in this comment to you...but suggest a few other resources for further study which i did myself last weekend ... https://docs.ansible.com/ansible-container/ and https://galaxy.ansible.com/docs/contributing/creating_role.html#role-metadata

For us in Ericsson, we are fully embracing Ansible Container along with OCI image-spec. For us in Ericsson, the sharing of metadata of Containers will be handled in Container registries themselves. If search engines want/need to crawl public Container registries then that's fine. But my personal opinion upon quite a bit of reflection over the past weekend, is that I think OCI should be the primary movers in this space, rather than Schema.org. And you should contribute to that effort directly HERE -> https://github.com/opencontainers/image-spec/blob/master/schema/content-descriptor.json

vsoch commented 6 years ago

I definitely agree that OCI is leading in the space, and this is why I am mirroring the specs. I can't comment on Ericsson, I don't know anything about that company other than maybe a phone company?

I am definitely in support of the idea to develop where the community is moving and thriving. The missing component of going directly to that effort THERE is that (and please correct me if I'm wrong) there is no connection between that initiative and plugging into a super-power search engine like Google, for having containers indexed by Google Datasets. You can make the greatest of standards or other but when push comes to shove, if you don't set it up with the right plumbing it's not going to be useful to the graduate student sitting in his dorm room trying to do an efficient search for something. My understanding is that there is no direct feed from the work of OCI into such a global tool, beyond a small strategy of having individual registries providing the result of the effort through their individual APIs. Is this incorrect?

I'm rather marginal / not opinionated about which standard to embrace, so I don't have opinion on "the standard that is best" but I do want to choose the deployment infrastructure that can best and most efficiently distribute the search. That (in my mind) looks more like Google than Docker Hub or similar. This is the reason I've taken this route - schema.org is the gateway to that amazing resource.

vsoch commented 6 years ago

Anyway - @charlesvardeman when you get back, I started with a very basic template for Hardware --> https://openschemas.github.io/spec-hardware/. This is the skeleton level of representation that I would want to have for Hardware, and Virtualization, and then I can develop container (and others with expertise in the previously mentioned would work on those).

The idea is that you can easily make changes, open a PR to test, and then merge will update the web interface. Then you can ask others for review discussion, and "publication" at https://openschemas.github.io/specifications is just moving a file and then another pull request. And everyone here - this is where I'd want to ask for advise, help, etc. on converting my front matter (yml) to json-ld and "whatever format is needed for schemaorg submission." But as it was pointed out, we should have discussion about the specifications here before that.

Back to @charlesvardeman (and others interested!) my understanding is that schema.org can embrace other ontology definitions, so if you would be interested to contribute to these templates, what I have done so far is downloading the tsv files from the Google Sheet mentioned in the README (but there isn't any reason you can't open them (with tab separation) on your computer. Right now, there isn't anything special there - it's just inheriting properties of "Thing." You can make changes to your heart's content, then PR and bring in others for discussion.

Anyway - I'm still looking forward to feedback on the above. In the meantime I'll start skeleton templates for the hierarchy I defined above, and can try to figure out how to make json-ld if nobody has a script or example.

vsoch commented 6 years ago

and I want to ask @rvguha how can the work that @thadguidry linked at opencontainers be linked to Google Dataset search? If we can do that, and the development / community is more active there, it might be best to pursue that connection instead.

vsoch commented 6 years ago

And @thadguidry I think we need to figure out how to work together. The expertise for containers indeed comes from opencontainers, but the expertise for everything else that might be modeled (e.g., look at the list of things in schema.org --> https://schema.org/docs/full.html) comes from there.

We must be able to have the work that is being done by opencontainers represented in that larger graph, otherwise it's a limited view of a small domain with no understanding of how it fits into a big picture.

ptsefton commented 6 years ago

I am wondering if Hardware and Container are too broad- this is likely to overlap with more general uses of the terms. Maybe ComputationalContainer and ComputationalHardware?

On Fri, 5 Oct 2018 at 11:37, Vanessa Sochat notifications@github.com wrote:

And @thadguidry https://github.com/thadguidry I think we need to figure out how to work together. The expertise for containers indeed comes from opencontainers, but the expertise for everything else that might be modeled (e.g., look at the list of things in schema.org --> https://schema.org/docs/full.html) comes from there.

We must be able to have the work that is being done by opencontainers represented in that larger graph, otherwise it's a limited view of a small domain with no understanding of how it fits into a big picture.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/schemaorg/schemaorg/issues/2059#issuecomment-427220330, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuJ2F9kfS9fmQzA3f4ReKNkK2HMBKcfks5uhrf1gaJpZM4WlGSS .

-- Peter Sefton +61410326955 pt@ptsefton.com http://ptsefton.com Gmail, Twitter & Skype name: ptsefton

thadguidry commented 6 years ago

@vsoch If data is not in Container Images...then I don't see why Google Datasets would bother indexing them in some fashion? But if Google Datasets is open to the idea that Datasets could be found in ANY format or package as so happens A LOT in Science domains (including Container Images as a package format)... then that would be @rvguha to talk to...not me or Ericsson my employer. I can only speak to what technologies we at Ericsson internally use to see what data/software might be lurking inside Container Images. You are on the right track Vanessa to solving your discoverability problems.

NOTE: Careful with just the casual use of the term "Container" (which are ephemeral) versus the more appropriate term for your use case of "Container Image".

vsoch commented 6 years ago

Definitely something I am aware of!. Albeit there is a lot more I'm not aware of, I'm doing my best :) I'll let @rvguha chime in on how the two world can work together - it would be perfect if the experts in a domain can easily "plug in" their work to the larger graph.