monarch-initiative / mondo

Mondo Disease Ontology
http://obofoundry.org/ontology/mondo
Creative Commons Attribution 4.0 International
232 stars 53 forks source link

Medgen fails currently because of a small number of wrong ids #2514

Open matentzn opened 3 years ago

matentzn commented 3 years ago
grep 'id: UMLS:.*[^A-Z0-9].*' sources/medgen/medgen.obo > medgen_ids.txt

Results in ids like:

id: UMLS:(D-Ala(2))-deltorphin-I
id: UMLS:(DL)-3,7-dihydro-1,8-dimethyl-3-(2-methylbutyl)-1H-purine-2,6-dione
id: UMLS:(VIP-neurotensin) hybrid antagonist
id: UMLS:1-ethyl-2-((1,4-dimethyl-2-phenyl-6-pyrimidinylidene)methyl)quinolinium chloride
id: UMLS:1-ethynylpyrene
id: UMLS:1-hexacosanol

There are about 320K correct ids in Medgen and about 2000 such cases. Not sure where the fault lies, maybe with us?

cmungall commented 3 years ago

hmm. Taking one of these at random 1-hexacosanol

with my older slurp from the ftp, I get

[Term]
id: UMLS:C0080483
name: 1-hexacosanol
xref: MEDGEN:42607
xref: MESH:C051942
subset: Organic-Chemical
subset: Pharmacologic-Substance
synonym: "1-hexacosanol" RELATED [MSH:C051942]
synonym: "n-hexacosanol" RELATED [MSH:C051942]
synonym: "hexacosyl alcohol" RELATED [MSH:C051942]
relationship: RB UMLS:C1563649 {source="MSH"} ! 1-hexacosanol, aluminum (1:3) salt
matentzn commented 3 years ago

Any strategy to debug? I have never touched the Medgen slurp code..

matentzn commented 3 years ago

I agree this term is there. But check this:

matentzn@mbp:~/ws/mondo/src/ontology (master) $ grep 'C0080483' sources/medgen/medgen.obo
xref: MEDGEN:C0080483
id: UMLS:C0080483
relationship: RN UMLS:C0080483 {source="MSH"} ! 1-hexacosanol
matentzn@mbp:~/ws/mondo/src/ontology (master) $ grep '1-hexacosanol' sources/medgen/medgen.obo
id: UMLS:1-hexacosanol
name: 1-hexacosanol
synonym: "1-hexacosanol" RELATED [MSH:C051942]
relationship: RB UMLS:C1563649 {source="MSH"} ! 1-hexacosanol, aluminum (1:3) salt
name: 1-hexacosanol, aluminum (1:3) salt
synonym: "1-hexacosanol, aluminum (1:3) salt" RELATED [MSH:C051942]
relationship: RN UMLS:C0080483 {source="MSH"} ! 1-hexacosanol
matentzn commented 3 years ago

Both seem to be!

maglott commented 3 years ago

What is the source of the slurp from medgen? The docsums or files on the ftp site? Just wondering what NCBI can do to help.

cmungall commented 3 years ago

Ftp files