linkml / prefixmaps

Semantic prefix map registry
https://linkml.io/prefixmaps/
Apache License 2.0
10 stars 3 forks source link

Add Bioportal prefixes #3

Closed caufieldjh closed 1 year ago

caufieldjh commented 1 year ago

These prefixes are curated from Bioportal ontologies. CURIE names correspond to Bioportal entries (e.g., NCBITAXON).

cthoyt commented 1 year ago

where does this come from?

caufieldjh commented 1 year ago

I extracted these from Bioportal database dumps. Because many of these prefixes only show up in the originally uploaded OWL, they aren't strictly "canonical" in the sense of being primary IRI prefixes, but all are used in one or more Bioportal ontologies.

cthoyt commented 1 year ago

Can you give more insight in how you extracted them?

caufieldjh commented 1 year ago

Sure! The Bioportal backend stores all ontologies in an aging but still functional 4store RDF DB (though this will change quite soon). I have the dump of this DB as of July 20, 2022. Each entry contains the full set of triples for the most recent submission of each ontology, as RDF n-triples.

From there, I use Bioportal-to-KGX to transform all ontologies to KGX TSV nodes/edges, converting node IDs to CURIEs wherever possible. I then check for any remaining IRIs with this script.

So there are three caveats re: getting a full set of prefixes from BioPortal:

  1. I've been doing this for QC purposes on transforms, so many of the prefixes are already used to convert node IDs to CURIEs. This behavior can be disabled through the CLI.
  2. Bioportal IRIs are messy - some are just the autogenerated Protege IDs. I've been handling this with manual curation for now.
  3. I'm assigning CURIEs based on their BP IDs, with some exceptions (OBOREL is still mapped back to RO for OBO compatibility)

For purposes of automating ETL, many (perhaps all?) IRIs may be retrieved through the BP API, though that's not what I'm currently doing.

cmungall commented 1 year ago

See also #4