Open cmungall opened 2 months ago
In addition to the discussion on the Bioregistry issue, I wonder if the problem that "there is no way to tell these kinds of prefixes from database prefixes" could be fixed by filtering out prefixes in this collection https://bioregistry.io/collection/0000002 (semantic web context) for downstream tasks where this is not desirable.
I think that makes sense. For now I'm going to just hardcode some more semweb prefixes in linked_data.curated.yaml
to be extra defensive.
Do you know offhand how to tell if a Record
belongs to a collection?
Assuming bioregistry
is available as a Python dependency, you can access the content of this collection as:
> import bioregistry
> semweb_prefixes = set(bioregistry.get_collection('0000002').resources)
> 'rdf' in semweb_prefixes
True
If you don't have the Python package available in that environment, then loading the collections directly as a JSON data structure is also possible: https://raw.githubusercontent.com/biopragmatics/bioregistry/main/src/bioregistry/data/collections.json
The merged prefix map now uses uppercase RDF rather than the canonical
rdf
The reason for this is likely because the merged prefix map is based around the assumption that the correct canonical prefix for biological databases is the uppercase form (bioregistry inherited the incorrect decision to downcase prefixes for major databases). merged therefore brings in
bioregistry.upper
. This uses the uppercase form unless there is a preferred prefix (e.g. FlyBase). Note also OBO is proritized, giving double protection against incorrect prefixes likego
andfbbt
.However, at some point bioregistry went from a bioregistry to include prefixes like
rdf
. There is no preferredrdf
form for this, and there is no way to tell these kinds of prefixes from database prefixes, And we don't includerdf
in our curated linked_data prefixmap, so this just slips through as uppercase...https://github.com/biopragmatics/bioregistry/issues/1090