schemaorg / suggestions-questions-brainstorming

Suggestions, questions, and brainstorming
20 stars 15 forks source link

Scientific Workflow Provenance #110

Open vsoch opened 5 years ago

vsoch commented 5 years ago

hey @schemaorg!

Thanks for your discussion so far in schemaorg/schemaorg#2059, we are making progress and I hope the meeting with @rvguha in January can solidify some of the work and review discussion with OCI and the toy examples I've created. This is a related issue that I'm working on in parallel - and it hits many more users (including those that don't use containers). The question at hand is:

How do we use schema.org to represent a scientific workflow?

In the context of containers, if we do have a representation of A Container in schema.org, we would simply define the relationships between them using some kind of Workflow model. Outside of containers, you can have other kinds of SoftwareApplication working together.

Does schema.org have a model?

From my basic search here, I can't find something that cleanly exists what I'm thinking of. You could say different interactions are kinds of Actions or that a workflow is (sort of) SoftwareSourceCode with some kind of Accessibility defined... but that really isn't right. I don't see a clear "scientific workflow" representation that would be desired, to go along with Dataset. Is there initiative working on this?

Does bioschemas have a model?

I next went back to bioschemas, and I thought I'd find simple concepts for workflows and experiments here. As far as I can tell, bioschemas is working on biological entties (akin to genes, proteins, etc.) and less so something like an experiment, a workflow, a step, an input, etc.

What about ontology?

From @satra I was pointed to what the neuroimaging data model (nidm) community is embracing to describe neuroimaging worklows - ProvOne is a nicely defined ontology. Has there been discussion about representation of these models into schema.org?

My Use Case

The Encode Data Portal (Encode-DCC) I beiieve has on the order of hundreds of thousands --> (million?) datasets with comlicated (many non container based) workflows that need to be described. Right now they have a fairly non-standarized description and need something better. In that I'm aware of the sheer number of labs / institutions with this need, and use of schema.org (again) to drive search, and the need to link Datasets with workflows, how can we best go about this?

Here are some links to share from master @satra!

I'll leave it at that to start our discussion! Thanks everyone.

vsoch commented 5 years ago

And quickly, some more background on the Encode-DCC use case:

These are the (non-standardized, or N=1 it's their own standard) definitions:

This does a good job to show the kinds of things we want to model. A huge chunk of them (the SoftwareApplication and SoftwareVersion) I know we already have in schema.org, awesome! It's the workflow bits (step, run, pipeline) that need to somehow be tied together.

The other interesting thing (these are notes from Seth in my group) is that there are associated Quality Control definitions. This is obviously huge outside of just workflows, it's for any kind of Software or thing that can be tested.

And also from Seth, an example page and json that is rendered from the metadata above:

So while a group like ours could go and come up with some modified niche thing for our N=1 use case, we really go a lot farther enaging with others to work on (already underway coordinated efforts) that are birds of a feather. This is what I'd like to do for Encode-DCC, and I'd like this to have much broader impact than just one or two groups at Stanford, or even one domain. This is what I'd like to talk about.

vsoch commented 5 years ago

A little more background and things I'm learning

Research Objects

ResearchObjects is a related ontology that seems to be about components involved in publications, or mroe specifically for workflows provenance for files / folders they produce. For example, CWL uses it via "CWLprov" to do this. There is a workflow description that (not surprisingly!) also goes back to ProvONE I mentioned earlier.

Workflow Focused Initiatives

Holy cow there are a lot! Looking forward to hearing your thoughts.

tetron commented 5 years ago

I think it depends on what you're trying to do. If you want something directly executable (for reproducibility) but still grounded in linked data + software containers based then you want something like Research Objects + CWL. If you want to answer higher level queries about what kinds of things a workflow does, you want semantic markup with domain-specific ontologies like EDAM. I don't know the value is of workflow ontologies that are not specific enough to be directly executable.

mr-c commented 5 years ago

The Encode use cases may also be a good fit for BioCompute Objects which can interoperate with RO/CWLProv.

vsoch commented 5 years ago

@mr-c BioCompute looks fantastic! I'm especially happy to see a very active Github repo and community too :) I'll run this by my group and return with any questions. Thanks for leaving the issue open as we discuss! Here are some Github links for others interested:

RichardWallis commented 4 years ago

See issue #7 for the context of the move from the main Schema.org issue tracker to this repository.