Closed bschilder closed 11 months ago
Curently, the diseases in the HPOA database should all have either OMIM, Orphanet, or DECIPHER identifiers. Note that the entries of the HPO such as HP:0001281 do not refer to disease entities (I am not sure if I understnad the above) Could you give some examples of what is not working?
Curently, the diseases in the HPOA database should all have either OMIM, Orphanet, or DECIPHER identifiers.
Correct, but I'm struggling to find a comprehensive resource that maps these onto definitions of each disease (beyond just the ID and name).
Note that the entries of the HPO such as HP:0001281 do not refer to disease entities (I am not sure if I understnad the above) Could you give some examples of what is not working?
This was just an example of what I mean when I say definition. The code is only meant as an analogous example using phenotypes instead of diseases.
In addition, I've just noticed that there's 3,031+ phenotypes that seem to be missing definitions in the latest HPO OBO object.
> hpo = HPOExplorer::get_hpo()
> sum(is.na(hpo$def))
[1] 3031
> hpo$def[is.na(hpo$def)]
I've saved them to a CSV attached here: missing_hpo_ids.csv
I see -- we have been working on this and have been steadily adding and improving definitions, but we have found so far at least that there is no accurate way to automate this. We welcome suggestions for definitions (which we like to have a reference to a PMID and also to have synonyms if possible)! But code should not at present assume that every term has a definition.
I see -- we have been working on this and have been steadily adding and improving definitions, but we have found so far at least that there is no accurate way to automate this. We welcome suggestions for definitions (which we like to have a reference to a PMID and also to have synonyms if possible)! But code should not at present assume that every term has a definition.
Cool, so I think you're referring specifically to the HPO phenotype definitions. Good to know this is expected.
Could you also comment on the original question on Disease definitions?
Disease definitions are not in the scope of the HPO. However, in effect, the HPOA annotations are computational definitions of 8500 diseases -- have you seen the phenotype.hpoa file? (https://hpo.jax.org/app/data/annotations) You may want to consult Mondo - http://www.ebi.ac.uk/ols4/ontologies/mondo for genetic diseases, https://omim.org/ provides comprehensive information that in effect is a very detailed definition of each Mendelian disease; similar things could be said of https://orpha.net.
However, in effect, the HPOA annotations are computational definitions of 8500 diseases -- have you seen the phenotype.hpoa file? (https://hpo.jax.org/app/data/annotations)
Yes, this is one of the files I was referencing in my original post. To clarify, I am making a distinction between "name" and "definition". Let's use OMIM:154700 as an example:
"Marfan syndrome"
A disorder of the connective tissue. Connective tissue provides strength and flexibility to structures throughout the body such as bones, ligaments, muscles, walls of blood vessels, and heart valves. Marfan syndrome affects most organs and tissues, especially the skeleton, lungs, eyes, heart, and the large blood vessel that distributes blood from the heart to the rest of the body (the aorta). It is caused by mutations in the FBN1 gene, which provides instructions for making a protein called fibrillin-1. Marfan syndrome is inherited in an autosomal dominant pattern. At least 25% of cases are due to a new (de novo) mutation. Treatment is based on the signs and symptoms in each person.
This is what I was trying to get across with the analogous phenotype example above. Currently, the phenotype.hpoa only provides a disease "name".
> annot <- HPOExplorer::load_phenotype_to_genes(3)
Reading cached RDS file: phenotype.hpoa
+ Version: v2023-10-09
> str(annot)
Disease definitions are not in the scope of the HPO.
Ok, but in this case the info in question is already displayed on HPO: https://hpo.jax.org/app/browse/disease/OMIM:154700
Presumably, to create the website HPO must have this compiled definition data stored somewhere. I am asking for access to this data, or to simply include it in the phenotype.hpoa file (it is not currently).
You may want to consult Mondo - http://www.ebi.ac.uk/ols4/ontologies/mondo for genetic diseases, https://omim.org/ provides comprehensive information that in effect is a very detailed definition of each Mendelian disease; similar things could be said of https://orpha.net.
I've tried doing exactly this for each database, but with limited success due to differing file formats, data access restrictions, and incomplete ID mappings.
@bschilder MedGen may be able to help with your disease definitions issue and cross-referencing terms by ID. Our FTP https://ftp.ncbi.nlm.nih.gov/pub/medgen/ files include a file of definitions collected by MedGen- MGDEF.RRF.gz
File: MGDEF.RRF.gz
Summary data for definitions and sources of concepts.
NOTE : Please note that some values in the DEF column contain internal line feeds. The line separator for RRF files is '|\n'. The line separator within the DEF column of MGDEF.RRF is '\r', CR (Carriage return, '\r', 0x0D, 13 in decimal). Unix/Linux and windows tool sometimes behave differently on these formats. If this format is problematic for you, consider use of the comma-separated value (csv) files in the csv subdirectory. (https://ftp.ncbi.nlm.nih.gov/pub/medgen/csv/)
We have a separate report that links the CUIs to the various database identifiers that are mapped in MedGen: "MedGenIDMappings.txt.gz"
File: MedGenIDMappings.txt.gz
Summary data for MedGen's assigned CUI, preferred name and other source database identifiers. MedGen has many data sources and aligns these ontologies and vocabularies to represent unified disease concepts. The various identifiers are mapped to a CUI and MedGen's preferred name for the concept.
Sources reported: -SNOMED CT -MeSH -Human Phenotype Ontology (HPO) -Mondo -Online Mendelian Inheritance in Man (OMIM) -OMIM disease records -OMIM phenotypes from specific alleles -OMIM phenotypic series -OMIM included diseases -Orphanet -MedGen UID
I see. @matentzn and @iimpulse -- I think we are getting the definitions from Mondo. Can we supply Brian with the correct API call or other info to reproduce this?
This is the call we use on the HPO page to get disease definition. In the near future (~2 months) we will ingest the mondo data ourselves and provide that in the API. Hope this helps.
@bschilder MedGen may be able to help with your disease definitions issue and cross-referencing terms by ID. Our FTP https://ftp.ncbi.nlm.nih.gov/pub/medgen/ files include a file of definitions collected by MedGen- MGDEF.RRF.gz
File: MGDEF.RRF.gz
Summary data for definitions and sources of concepts.
- CUI: concept unique identifier
- DEF: concept definition. Please see NOTE below
- source: source of the definition
- SUPPRESS: suppressed by UMLS curators (no reason is reported)
NOTE : Please note that some values in the DEF column contain internal line feeds. The line separator for RRF files is '|\n'. The line separator within the DEF column of MGDEF.RRF is '\r', CR (Carriage return, '\r', 0x0D, 13 in decimal). Unix/Linux and windows tool sometimes behave differently on these formats. If this format is problematic for you, consider use of the comma-separated value (csv) files in the csv subdirectory. (https://ftp.ncbi.nlm.nih.gov/pub/medgen/csv/)
We have a separate report that links the CUIs to the various database identifiers that are mapped in MedGen: "MedGenIDMappings.txt.gz"
File: MedGenIDMappings.txt.gz
Summary data for MedGen's assigned CUI, preferred name and other source database identifiers. MedGen has many data sources and aligns these ontologies and vocabularies to represent unified disease concepts. The various identifiers are mapped to a CUI and MedGen's preferred name for the concept.
- CUI: concept unique identifier
- Preferred name: MedGen's preferred name
- extrn_id: the identifier from the source
- extrn_src: abbreviated name for the source
Sources reported: -SNOMED CT -MeSH -Human Phenotype Ontology (HPO) -Mondo -Online Mendelian Inheritance in Man (OMIM) -OMIM disease records -OMIM phenotypes from specific alleles -OMIM phenotypic series -OMIM included diseases -Orphanet -MedGen UID
Amazing, thanks so much for the detailed reply @kanems ! I've had a look through this data and I found it quite helpful for ID mapping across ontologies.
This is the call we use on the HPO page to get disease definition. In the near future (~2 months) we will ingest the mondo data ourselves and provide that in the API. Hope this helps.
@iimpulse this is also super helpful. I hadn't realized the website was rendering dynamically by pulling resources via the Monarch API, very cool!
And thank you, i think this will be an excellent resource to have in one easily used file! I'll continue integrating this into HPOExplorer
to make sure they're as in-sync as possible.
Thanks, everybody, I will close this for now but please open a new issue as needed!
Hello!,
I can't seem to find any resources that comprehensively map the Disease IDs in the HPO-provided annotation files ("disease_id", or sometimes "Database ID") to:
I've made several attempts to aggregate this data from non-HPO resources, but so far my missing rate for gathering Disease definition is still >90%.
Does HPO keep an internal record of all of the Disease IDs mapped onto Disease definitions? If so, could this resources be distributed:
Many thanks in advance, Brian