linkml / prefixmaps

Semantic prefix map registry
https://linkml.io/prefixmaps/
Apache License 2.0
10 stars 3 forks source link

Contexts don't guarantee bi-directional mapping #11

Open cthoyt opened 1 year ago

cthoyt commented 1 year ago

In the BioPortal context file (https://github.com/linkml/prefixmaps/blob/main/src/prefixmaps/data/bioportal.csv#L199-L202), two prefixes share the same URI prefix. This seems to invalidate the claim on the README about bijectivity (https://github.com/linkml/prefixmaps/blob/main/README.md?plain=1#L13).

Maybe this is a curation oversight, which I bet @caufieldjh has already found, since I know he recently put a lot of effort into looking through this content. However, this isn't resolved when loading content through the package, which makes me think that the package should be more careful about checking the integrity of content. The following code illustrates:

from prefixmaps import load_context
context = load_context("bioportal")
for e in context.prefix_expansions:
    if e.prefix in {"INVERSEROLES", "ISO-15926-2_2003"}:
         print(e)

# Output:
# PrefixExpansion(context='bioportal', prefix='INVERSEROLES', namespace='http://rds.posccaesar.org/2008/02/OWL/ISO-15926-2_2003#', status=<StatusType.canonical: 'canonical'>)
# PrefixExpansion(context='bioportal', prefix='ISO-15926-2_2003', namespace='http://rds.posccaesar.org/2008/02/OWL/ISO-15926-2_2003#', status=<StatusType.canonical: 'canonical'>)

This means the assumptions in Context.as_dict() are also incorrect, since this naively iterates through the expansions and picks out the prefix/URI prefix (namespace) pairs.

I'm pretty stumped trying to understand the data structure used in this package, it seems like a lot of things that could be grouped are not. Have you considered using a JSON structure?

For example, the curies package has a lot of overlap in terms of needing to represent a group of related prefixes and URI prefixes while denoting which is the "canonical" prefix and "canonical" URI prefix. This data structure is described in the curies documentation and a more full example with the whole Bioregistry can be found here.

Background: I'm currently trying to implement a more principled import of a Context object from this package into a curies.Converter in https://github.com/cthoyt/curies/pull/22 and am stuck since there's no way to decide which of these two canonical records should be the actual canonical record.

To Do

caufieldjh commented 1 year ago

two prefixes share the same URI prefix

I'm going to chalk that up to curation error - one ontology is a variation on the other, so the easy solution would just be to assign ISO-15926-2_2003 as the canonical prefix. I'll update accordingly (and probably catch some other issues in the process).

But yes, that means the assumption is that the input maps are bijective, without explicit validation.

cthoyt commented 1 year ago

Great, thanks. I would say besides just implementing this fix, this issue requires adding some automated tests for all content that can be run on CI and will stop new data from being added that doesn't pass these requirements

caufieldjh commented 1 year ago

Fixes to those maps are in d58f19980449e1768b160c71fcf69eb4ebaac491

cthoyt commented 1 year ago

Following adding tests in #15, this issue can be closed if there’s branch protection on the main branch and a rule that tests have to pass

cmungall commented 1 year ago

@caufieldjh curated content should really go in a curated.yaml, the csvs should be entirely autogenerated. In fact we would eventually like to cede the curated maps upstream to bioregistry, but it is useful to be use a placeholder here for now.

cthoyt commented 1 year ago

I added #26, which is also a blocker for closing this issue.

caufieldjh commented 1 year ago

@cmungall do you mean something along the same lines of https://raw.githubusercontent.com/geneontology/go-site/master/metadata/db-xrefs.yaml ?