linkml / prefixmaps

Semantic prefix map registry
https://linkml.io/prefixmaps/
Apache License 2.0
10 stars 3 forks source link

Request for propogating CURIE prefix synonyms #47

Closed cthoyt closed 8 months ago

cthoyt commented 8 months ago

Would it be possible to propagate the CURIE prefix synonyms from the various sources in the merged context? I had data that was annotated with synonyms, but I wasn't able to use the converter that came out of prefixmaps to handle them.

Here's an example to illustrate. The Bioregistry lists pubmed, PubMed, pmid, PMID, and MEDLINE as synonyms for PubMed, but only the uppercased PUBMED appears to work.

import prefixmaps
import pandas as pd

def main():
    converter = prefixmaps.load_converter("merged")
    curies = ["pmid:1234", "PMID:1234", "pubmed:1234", "PubMed:1234", "PUBMED:1234", "MEDLINE:1234"]
    rows = [(curie, converter.standardize_curie(curie)) for curie in curies]
    df = pd.DataFrame(rows, columns=["raw", "standardized"])
    print(df.to_markdown(index=False))

if __name__ == "__main__":
    main()
raw standardized
pmid:1234
PMID:1234
pubmed:1234
PubMed:1234
PUBMED:1234 PUBMED:1234
MEDLINE:1234
sierra-moxon commented 8 months ago

Thanks for the issue @cthoyt - yes, this is what I am working on now, in light of the reticence to add a "preferred" prefix in bioregistry. I think using the synonyms will most often work just as well. :) I can theoretically imagine use cases where having synonyms available in the prefixmap will mean we have problems downstream in our KGs (e.g. we need validation to prevent PMID:123456 identified node from being duplicative with Pubmed:123456 identified node).

cthoyt commented 8 months ago

@sierra-moxon in this case, you simply use the Converter.standardize_curie which already implements the logic for making sure you don't get multiple equivalent CURIEs based on the synonyms inside the converter.