linkml / prefixmaps

Semantic prefix map registry
https://linkml.io/prefixmaps/
Apache License 2.0
10 stars 3 forks source link

bioregistry changes have broken the merged prefix map #70

Open cmungall opened 2 months ago

cmungall commented 2 months ago

The merged prefix map now uses uppercase RDF rather than the canonical rdf

The reason for this is likely because the merged prefix map is based around the assumption that the correct canonical prefix for biological databases is the uppercase form (bioregistry inherited the incorrect decision to downcase prefixes for major databases). merged therefore brings in bioregistry.upper. This uses the uppercase form unless there is a preferred prefix (e.g. FlyBase). Note also OBO is proritized, giving double protection against incorrect prefixes like go and fbbt.

However, at some point bioregistry went from a bioregistry to include prefixes like rdf. There is no preferred rdf form for this, and there is no way to tell these kinds of prefixes from database prefixes, And we don't include rdf in our curated linked_data prefixmap, so this just slips through as uppercase...

https://github.com/biopragmatics/bioregistry/issues/1090

bgyori commented 2 months ago

In addition to the discussion on the Bioregistry issue, I wonder if the problem that "there is no way to tell these kinds of prefixes from database prefixes" could be fixed by filtering out prefixes in this collection https://bioregistry.io/collection/0000002 (semantic web context) for downstream tasks where this is not desirable.

cmungall commented 2 months ago

I think that makes sense. For now I'm going to just hardcode some more semweb prefixes in linked_data.curated.yaml to be extra defensive.

Do you know offhand how to tell if a Record belongs to a collection?

bgyori commented 2 months ago

Assuming bioregistry is available as a Python dependency, you can access the content of this collection as:

> import bioregistry
> semweb_prefixes = set(bioregistry.get_collection('0000002').resources)
> 'rdf' in semweb_prefixes
True

If you don't have the Python package available in that environment, then loading the collections directly as a JSON data structure is also possible: https://raw.githubusercontent.com/biopragmatics/bioregistry/main/src/bioregistry/data/collections.json