Deconvolute non-standard prefixes

cthoyt commented 2 years ago

The following prefixes show up in various places in UBERON but they are not in the Bioregistry, based on the OQUAT analysis in https://biopragmatics.github.io/oquat/unknowns/source/uberon and https://biopragmatics.github.io/oquat/invalids/source/uberon:

prefix	count	example_node	example_val
OBOL	3401	http://purl.obolibrary.org/obo/UBERON_0000031	OBOL:automatic
GAID	814	http://purl.obolibrary.org/obo/UBERON_0000002	GAID:376
PHENOSCAPE	281	http://purl.obolibrary.org/obo/UBERON_4200008	PHENOSCAPE:wd
BM	246	http://purl.obolibrary.org/obo/UBERON_0000007	BM:Die-Hy-HY
FBC	121	http://purl.obolibrary.org/obo/UBERON_0000122	FBC:DOS
UBERONTEMP	105	http://purl.obolibrary.org/obo/UBERON_0016929	UBERONTEMP:0ea3066e-0c22-417b-8ac4-91c2aacba792
GOC	70	http://purl.obolibrary.org/obo/UBERON_0000017	GOC:GO
ABA	62	http://purl.obolibrary.org/obo/UBERON_0000955	ABA:Brain
UBERONREF	45	http://purl.obolibrary.org/obo/UBERON_0000075	UBERONREF:0000003
MURDOCH	38	http://purl.obolibrary.org/obo/UBERON_0011472	MURDOCH:2183
WikipediaVersioned	31	http://purl.obolibrary.org/obo/UBERON_8410000	WikipediaVersioned:Duodenojejunal_flexure&oldid=937307798
BSA	26	http://purl.obolibrary.org/obo/UBERON_0000020	BSA:0000121
FEED	20	http://purl.obolibrary.org/obo/UBERON_0001572	FEED:rd
Dorlands_Medical_Dictionary	16	http://purl.obolibrary.org/obo/UBERON_0000313	Dorlands_Medical_Dictionary:MerckSource
ANISEED	13	http://purl.obolibrary.org/obo/UBERON_0000160	ANISEED:1235303
OGES	13	http://purl.obolibrary.org/obo/UBERON_0000068	OGES:000022
NominaAnatomicaVeterinaria	12	http://purl.obolibrary.org/obo/UBERON_0001451	NominaAnatomicaVeterinaria:2005
LG	11	http://purl.obolibrary.org/obo/UBERON_0004889	LG:0012616
OldNeuroNames	9	http://purl.obolibrary.org/obo/UBERON_0002575	OldNeuroNames:-1761421113
BILS	9	http://purl.obolibrary.org/obo/UBERON_0000105	BILS:0000105
BilaDO	9	http://purl.obolibrary.org/obo/UBERON_0000066	BilaDO:0000004
BRAINSPAN	8	http://purl.obolibrary.org/obo/UBERON_0014736	BRAINSPAN:BRAINSPAN
NIFSTD_RETIRED	8	http://purl.obolibrary.org/obo/UBERON_0000966	NIFSTD_RETIRED:birnlex_1156
Geisha	7	http://purl.obolibrary.org/obo/UBERON_0003052	Geisha:syn
WikipediaCategory	7	http://purl.obolibrary.org/obo/UBERON_0000474	WikipediaCategory:Female_reproductive_system
XtroDO	7	http://purl.obolibrary.org/obo/UBERON_0000066	XtroDO:0000084
Bgee	5	http://purl.obolibrary.org/obo/UBERON_0018241	Bgee:AN
XB	5	http://purl.obolibrary.org/obo/UBERON_0003056	XB:curator
NeuroNamesCNID	5	http://purl.obolibrary.org/obo/UBERON_0015510	NeuroNamesCNID:177
BrainInfo	4	http://purl.obolibrary.org/obo/UBERON_8440010	BrainInfo:2102
NIF	4	http://purl.obolibrary.org/obo/UBERON_0009630	NIF:NIF
DHB	3	http://purl.obolibrary.org/obo/UBERON_0002739	DHB:MD
J	3	http://purl.obolibrary.org/obo/UBERON_0002233	J:77634
PhenoscapeRCN	3	http://purl.obolibrary.org/obo/UBERON_0012260	PhenoscapeRCN:Oct2012
CUMBO	2	http://purl.obolibrary.org/obo/UBERON_0001020	CUMBO:CUMBO
INCF	2	http://purl.obolibrary.org/obo/UBERON_0001880	INCF:Seattle_mtg_2010
MorphoBank	2	http://purl.obolibrary.org/obo/UBERON_0013614	MorphoBank:177
NominaAnatomica	2	http://purl.obolibrary.org/obo/UBERON_0010356	NominaAnatomica:NA
Obol	2	http://purl.obolibrary.org/obo/UBERON_0003281	Obol:obol
PAPUB	2	http://purl.obolibrary.org/obo/UBERON_2001162	PAPUB:0000142
Phenoscape	2	http://purl.obolibrary.org/obo/UBERON_4000164	Phenoscape:PM
Swanson	2	http://purl.obolibrary.org/obo/UBERON_0001893	Swanson:2004
NIF_Organism	2	http://purl.obolibrary.org/obo/UBERON_0007221	NIF_Organism:birnlex_695
NOID	2	http://purl.obolibrary.org/obo/UBERON_0018367	NOID:1
OGEM	2	http://purl.obolibrary.org/obo/UBERON_0000307	OGEM:000006
BioMart	1	http://purl.obolibrary.org/obo/UBERON_0000363	BioMart:BioMart
CHECKME	1	http://purl.obolibrary.org/obo/UBERON_0003997	CHECKME:CHECKME
Giesha	1	http://purl.obolibrary.org/obo/UBERON_0005421	Giesha:syn
Hymans	1	http://purl.obolibrary.org/obo/UBERON_0010260	Hymans:Hymans
MTB	1	http://purl.obolibrary.org/obo/UBERON_0002145	MTB:379
AOO	1	http://purl.obolibrary.org/obo/UBERON_3000406	AOO:LAP
ASD	1	http://purl.obolibrary.org/obo/UBERON_3010449	ASD:BJB
Fast_Health_Medical_Dictionary	1	http://purl.obolibrary.org/obo/UBERON_0008230	Fast_Health_Medical_Dictionary:http://www.fasthealth.com/dictionary/
NCBI	1	http://purl.obolibrary.org/obo/UBERON_0001471	NCBI:matt
OMD	1	http://purl.obolibrary.org/obo/UBERON_0003075	OMD:neural+plate
PATOC	1	http://purl.obolibrary.org/obo/UBERON_0005160	PATOC:MAH
PLB	1	http://purl.obolibrary.org/obo/UBERON_0013730	PLB:plb
Renal_Physiology	1	http://purl.obolibrary.org/obo/UBERON_0008404	Renal_Physiology:Section_7
WA	1	http://purl.obolibrary.org/obo/UBERON_0003049	WA:dh
Wiktionary	1	http://purl.obolibrary.org/obo/UBERON_7500117	Wiktionary:opisthocranion
bgee	1	http://purl.obolibrary.org/obo/UBERON_0036219	bgee:ANN
ref	1	http://purl.obolibrary.org/obo/UBERON_0004870	ref:Stedmans
DrerDO	1	http://purl.obolibrary.org/obo/UBERON_0004707	DrerDO:0000052
MAP	1	http://purl.obolibrary.org/obo/UBERON_0001155	MAP:0000001
TA2	1	http://purl.obolibrary.org/obo/UBERON_8410000	TA2:2952
Talairach	1	http://purl.obolibrary.org/obo/UBERON_0035933	Talairach:1047

Generated by the following code:

from tabulate import tabulate
from collections import Counter

import requests

def main():
    url = "https://raw.githubusercontent.com/biopragmatics/oquat/main/results/uberon.json"
    data = requests.get(url).json()

    counter = Counter()
    examples = {}
    for data in data["results"].values():
        for key in ["synonym_pack", "prov_pack", "xref_pack"]:
            for prefix, uri_to_value_dict in data[key]["unknown_prefixes"].items():
                counter[prefix] += len(uri_to_value_dict)
                examples[prefix] = list(uri_to_value_dict.items())[0]

    rows = [(prefix, count, *examples[prefix]) for prefix, count in counter.most_common()]

    print(
        tabulate(
            rows, headers=["prefix", "count", "example_node", "example_val"], tablefmt="github"
        )
    )

if __name__ == "__main__":
    main()

Any help figuring out what these are and how they're used would be appreciated!

patrick-lloyd-ray commented 2 years ago

I think I might know some of these, as they look like Allen Institute-related prefixes:

DHBA = developing human brain atlas (https://github.com/obophenotype/uberon/blob/master/source-ontologies/allen-dhba.obo) HBA = human brain atlas (https://github.com/obophenotype/uberon/blob/master/source-ontologies/allen-hba.obo) DMBA = developing mouse brain atlas (https://github.com/obophenotype/uberon/blob/master/source-ontologies/allen-dmba.obo) MBA = mouse brain atlas (https://github.com/obophenotype/uberon/blob/master/source-ontologies/allen-mba.obo) ABA = Allen Brain Atlas -- this is the name for all of the the brain atlases at Allen (https://en.wikipedia.org/wiki/Allen_Brain_Atlas), see also https://github.com/obophenotype/ABA_Uberon PBA = primate brain atlas (non-human) (https://github.com/obophenotype/uberon/blob/master/source-ontologies/allen-pba.obo)

I don't know the history of their usage very well (these prefixes pre-date my working at the Allen Institute), but I know they were used in some mapping projects and we still have a need for them. I think I also recognize these two:

NLX = neurolex nlx_subcell = neurolex subcellular structure

I don't know how these were used, but @tgbugs would probably know.

cthoyt commented 2 years ago

@patrick-lloyd-ray that's an excellent start! thanks so much. For anyone who might be able to provide more contexts, I'm also looking for web references describing what these things are and if possible, some links to a list of the terms that go in each or ontology files if they exist

tgbugs commented 2 years ago

Here are the mappings that I have.

@prefix MBA: <http://api.brain-map.org/api/v2/data/Structure/> .
@prefix HBA: <http://api.brain-map.org/api/v2/data/Structure/> .
@prefix DHBA: <http://api.brain-map.org/api/v2/data/Structure/> .
@prefix DMBA: <http://api.brain-map.org/api/v2/data/Structure/> .
@prefix NLX: <http://uri.neuinfo.org/nif/nifstd/nlx_> .
@prefix NLXSUB: <http://uri.neuinfo.org/nif/nifstd/nlx_subcell_> .

ABA refers to the very first version of the terminology that was modelled in owl using subClassOf instead of partOf. https://bioportal.bioontology.org/ontologies/ABA-AMB I think the mapping is

@prefix ABA: <http://mouse.brain-map.org/atlas/index.html#> .

cmungall commented 2 years ago

Thanks @tgbugs and @patrick-lloyd-ray this is correct

I think we can replace all ABA xrefs and use MBA instead

As far as I know we don't have official prefixes for the allen atlases yet, @dosumis? As soon as we get these we should add to bioregistry. But until then these xrefs are vital.

cmungall commented 2 years ago

@tgbugs we of course wouldn't have NLXSUB in uberon but GO uses xrefs like NIF_Subcellular:nlx_subcell_100315 - if NLXSUB is your preferred prefix let's register it and use in in GO!

shawntanzk commented 2 years ago

Plan:

1) Start with creating a table

prefix	source

2) remove those we don't want

3) register with bioregistry those that stay in (if not use bioregistry prefix)

cthoyt commented 2 years ago

I added several Allen Brain Atlas prefixes suggested by @patrick-lloyd-ray in https://github.com/biopragmatics/bioregistry/commit/1cfd0ff4788940974d20548228c2c877d0d7df55 (though not ABA since Chris said these should be upgraded)

They will also have a collection at https://bioregistry.io/collection/0000005 that will go live with the nightly update tonight

cthoyt commented 2 years ago

@tgbugs we of course wouldn't have NLXSUB in uberon but GO uses xrefs like NIF_Subcellular:nlx_subcell_100315 - if NLXSUB is your preferred prefix let's register it and use in in GO!

I am very worried about stuff like this because of the amount of redundant prefix usage. Why isn't this just NIF_Subcellular:00315?

shawntanzk commented 2 years ago

Thanks heaps @cthoyt - sorry I havent managed to get around to doing this, really appreciate the help! :)

dosumis commented 2 years ago

As far as I know we don't have official prefixes for the allen atlases yet, @dosumis? As soon as we get these we should add to bioregistry. But until then these xrefs are vital.

I guess that means trying to request obolibrary status for the ontologised versions of ABA structuregraphs. As these are unlikely to fulfill QC required (e.g. they will never have text defs), isn't this unlikely?

cmungall commented 2 years ago

I think we can add to bioregistry independently (I think the URLs resolve..?)

but ideally they could be regularly deposited on something like OLS/BP/Ontobee too

cthoyt commented 2 years ago

@cmungall yes already done

patrick-lloyd-ray commented 2 years ago

I guess that means trying to request obolibrary status for the ontologised versions of ABA structuregraphs. As these are unlikely to fulfill QC required (e.g. they will never have text defs), isn't this unlikely?

I'm open to getting these to meet QC and in obolibrary, if there is community interest.

matentzn commented 2 years ago

You can secure a prefix in bioregistry w/o being an ontology and/or in OBO!

tgbugs commented 2 years ago

I am very worried about stuff like this because of the amount of redundant prefix usage. Why isn't this just NIF_Subcellular:00315?

Because the expansion is completely different.

@prefix NIFSUB: <http://ontology.neuinfo.org/NIF/BiomaterialEntities/NIF-Subcellular.owl#> .

The NIF_Subcellular prefix expands to ancient fragment based identifiers that cannot be resolved by the server (make bad assumptions about the design of the document and system that hosts the ontology ids) and which redirect via a bit of javascript to a proper resolver.

github-actions[bot] commented 1 year ago

This issue has not seen any activity in the past 6 months; it will be closed automatically one year from now if no action is taken.

cthoyt commented 1 year ago

Note that the ontology quality assessment toolkit site is now auto-generated weekly. The most up-to-date version for UBERON is at https://biopragmatics.github.io/oquat/unknowns/source/uberon

matentzn commented 1 year ago

@anitacaron my advice when you do a push on Uberon next time, just drop all the references oquat lists as 5 or less. This will clean up the situation significantly. UBERONREF is silly as well.

github-actions[bot] commented 1 year ago

This issue has not seen any activity in the past 6 months; it will be closed automatically one year from now if no action is taken.

anitacaron commented 9 months ago

@cthoyt can we please get an updated version of the table in the description?

cthoyt commented 9 months ago

@anitacaron yes, I updated the OQUAT website, added the code that generates the table, and updated the table at the top of the issue. FYI, the latest available JSON version of the ontology is from end of october

Ref:

from tabulate import tabulate
from collections import Counter

import requests

def main():
    url = "https://raw.githubusercontent.com/biopragmatics/oquat/main/results/uberon.json"
    data = requests.get(url).json()

    counter = Counter()
    examples = {}
    for data in data["results"].values():
        for key in ["synonym_pack", "prov_pack", "xref_pack"]:
            for prefix, uri_to_value_dict in data[key]["unknown_prefixes"].items():
                counter[prefix] += len(uri_to_value_dict)
                examples[prefix] = list(uri_to_value_dict.items())[0]

    rows = [(prefix, count, *examples[prefix]) for prefix, count in counter.most_common()]

    print(
        tabulate(
            rows, headers=["prefix", "count", "example_node", "example_val"], tablefmt="github"
        )
    )

if __name__ == "__main__":
    main()

obophenotype / uberon

Deconvolute non-standard prefixes #2205