sfb1451 / crc-schema-draft

https://sfb1451.github.io/crc-schema-draft/
0 stars 0 forks source link

Case study: NCBITaxon #7

Open mslw opened 9 months ago

mslw commented 9 months ago

This issue is about controlled dictionaries, and creating tabby input enriched by ontology lookup

Current state

The current sfb tabby requires sample[organism] to be expressed as ID in the NCBI organismal taxonomy, formatted as, e.g. NCBITaxon:9606.

This can translate (by string substitution) to http://purl.obolibrary.org/obo/NCBITaxon_9606, which can be looked up (also via an API), e.g. in OLS: NCBITaxon:9606 yielding e.g. label (i.e. Latin name) and exact synonym (genbank common name, i.e. English name).

For feeding this info to the catalog, I like using OpenMINDS controlled term for Species, because it has fields such as name (required), preferredOntologyIdentifier, and synonym. These map nicely into Latin name, IRI, and English name, and make it easy to create a catalog template for displaying this information.

Consequently, the dataset attribute is currently modelled as (note that using Species as attrubute IRI is probably not a good idea): https://github.com/sfb1451/crc-schema-draft/blob/3a233f58fa57f7593b81781761363cf38ad8f9d0/src/sfb1451_schema.yaml#L89-L101

and there is currently no range (or string pattern) defined, and there is no custom Species object definition.

Note: the same applies to sample[organismPart] / openminds:UBERONParcellation.

Questions

Thoughts

The problem is that a datalad-tabby convention could convert the NCBITaxon:1234 into a full IRI, but it couldn't (and probably shouldn't) perform an ontology lookup - links to a question, which stage of our processing we are modelling. With that, it cannot produce a valid openMINDS Species object (no name).

I really like an OpenMINDS-like representation for feeding data which is based on a controlled dictionary into the catalog.

I am tempted to define my own Species object, that would only have an IRI / preferred ontology identifier required, and other fields optional. Then, these fields could be filled in during preparation for the catalog. And our schema would sort-of live in the middle of the tabby-to-catalog process,

mslw commented 9 months ago

A snapshot from a whiteboard discussion, to refresh our memories:

PXL_20231213_143224428