obophenotype / human-phenotype-ontology

Ontology for the description of human clinical features
http://obophenotype.github.io/human-phenotype-ontology/
Other
293 stars 51 forks source link

Disease ID mappings #10232

Closed bschilder closed 11 months ago

bschilder commented 12 months ago

Hello!,

I can't seem to find any resources that comprehensively map the Disease IDs in the HPO-provided annotation files ("disease_id", or sometimes "Database ID") to:

  1. Disease definitions: analogous to the phenotype definitions provided in the HPO OBO object, but for diseases.
> hpo = HPOExplorer::get_hpo()
> hpo$def["HP:0001281"]
                                                                                                                                                  HP:0001281 
"\"A condition characterized by intermittent involuntary contraction of muscles (spasms) related to hypocalcemia or occasionally magnesium deficiency.\" []" 
  1. Cross-database IDs: e.g. UPHENO / MONDO / MEDLINE

I've made several attempts to aggregate this data from non-HPO resources, but so far my missing rate for gathering Disease definition is still >90%.

Does HPO keep an internal record of all of the Disease IDs mapped onto Disease definitions? If so, could this resources be distributed:

Many thanks in advance, Brian

pnrobinson commented 12 months ago

Curently, the diseases in the HPOA database should all have either OMIM, Orphanet, or DECIPHER identifiers. Note that the entries of the HPO such as HP:0001281 do not refer to disease entities (I am not sure if I understnad the above) Could you give some examples of what is not working?

bschilder commented 12 months ago

Curently, the diseases in the HPOA database should all have either OMIM, Orphanet, or DECIPHER identifiers.

Correct, but I'm struggling to find a comprehensive resource that maps these onto definitions of each disease (beyond just the ID and name).

Note that the entries of the HPO such as HP:0001281 do not refer to disease entities (I am not sure if I understnad the above) Could you give some examples of what is not working?

This was just an example of what I mean when I say definition. The code is only meant as an analogous example using phenotypes instead of diseases.

bschilder commented 12 months ago

In addition, I've just noticed that there's 3,031+ phenotypes that seem to be missing definitions in the latest HPO OBO object.

> hpo = HPOExplorer::get_hpo()
> sum(is.na(hpo$def))
[1] 3031
> hpo$def[is.na(hpo$def)]

I've saved them to a CSV attached here: missing_hpo_ids.csv

pnrobinson commented 12 months ago

I see -- we have been working on this and have been steadily adding and improving definitions, but we have found so far at least that there is no accurate way to automate this. We welcome suggestions for definitions (which we like to have a reference to a PMID and also to have synonyms if possible)! But code should not at present assume that every term has a definition.

bschilder commented 12 months ago

I see -- we have been working on this and have been steadily adding and improving definitions, but we have found so far at least that there is no accurate way to automate this. We welcome suggestions for definitions (which we like to have a reference to a PMID and also to have synonyms if possible)! But code should not at present assume that every term has a definition.

Cool, so I think you're referring specifically to the HPO phenotype definitions. Good to know this is expected.

Could you also comment on the original question on Disease definitions?

pnrobinson commented 12 months ago

Disease definitions are not in the scope of the HPO. However, in effect, the HPOA annotations are computational definitions of 8500 diseases -- have you seen the phenotype.hpoa file? (https://hpo.jax.org/app/data/annotations) You may want to consult Mondo - http://www.ebi.ac.uk/ols4/ontologies/mondo for genetic diseases, https://omim.org/ provides comprehensive information that in effect is a very detailed definition of each Mendelian disease; similar things could be said of https://orpha.net.

bschilder commented 12 months ago

However, in effect, the HPOA annotations are computational definitions of 8500 diseases -- have you seen the phenotype.hpoa file? (https://hpo.jax.org/app/data/annotations)

Yes, this is one of the files I was referencing in my original post. To clarify, I am making a distinction between "name" and "definition". Let's use OMIM:154700 as an example:

This is what I was trying to get across with the analogous phenotype example above. Currently, the phenotype.hpoa only provides a disease "name".

> annot <- HPOExplorer::load_phenotype_to_genes(3)
Reading cached RDS file: phenotype.hpoa
+ Version: v2023-10-09
> str(annot)

Screenshot 2023-11-16 at 15 07 11

Disease definitions are not in the scope of the HPO.

Ok, but in this case the info in question is already displayed on HPO: https://hpo.jax.org/app/browse/disease/OMIM:154700

Presumably, to create the website HPO must have this compiled definition data stored somewhere. I am asking for access to this data, or to simply include it in the phenotype.hpoa file (it is not currently).

You may want to consult Mondo - http://www.ebi.ac.uk/ols4/ontologies/mondo for genetic diseases, https://omim.org/ provides comprehensive information that in effect is a very detailed definition of each Mendelian disease; similar things could be said of https://orpha.net.

I've tried doing exactly this for each database, but with limited success due to differing file formats, data access restrictions, and incomplete ID mappings.

kanems commented 12 months ago

@bschilder MedGen may be able to help with your disease definitions issue and cross-referencing terms by ID. Our FTP https://ftp.ncbi.nlm.nih.gov/pub/medgen/ files include a file of definitions collected by MedGen- MGDEF.RRF.gz

File: MGDEF.RRF.gz

Summary data for definitions and sources of concepts.

NOTE : Please note that some values in the DEF column contain internal line feeds. The line separator for RRF files is '|\n'. The line separator within the DEF column of MGDEF.RRF is '\r', CR (Carriage return, '\r', 0x0D, 13 in decimal). Unix/Linux and windows tool sometimes behave differently on these formats. If this format is problematic for you, consider use of the comma-separated value (csv) files in the csv subdirectory. (https://ftp.ncbi.nlm.nih.gov/pub/medgen/csv/)

We have a separate report that links the CUIs to the various database identifiers that are mapped in MedGen: "MedGenIDMappings.txt.gz"

File: MedGenIDMappings.txt.gz

Summary data for MedGen's assigned CUI, preferred name and other source database identifiers. MedGen has many data sources and aligns these ontologies and vocabularies to represent unified disease concepts. The various identifiers are mapped to a CUI and MedGen's preferred name for the concept.

Sources reported: -SNOMED CT -MeSH -Human Phenotype Ontology (HPO) -Mondo -Online Mendelian Inheritance in Man (OMIM) -OMIM disease records -OMIM phenotypes from specific alleles -OMIM phenotypic series -OMIM included diseases -Orphanet -MedGen UID

pnrobinson commented 12 months ago

I see. @matentzn and @iimpulse -- I think we are getting the definitions from Mondo. Can we supply Brian with the correct API call or other info to reproduce this?

iimpulse commented 12 months ago

https://api.monarchinitiative.org/api/bioentity/disease/OMIM:154700?fetch_objects=false&unselect_evidence=true&exclude_automatic_assertions=false&get_association_counts=false&rows=1

This is the call we use on the HPO page to get disease definition. In the near future (~2 months) we will ingest the mondo data ourselves and provide that in the API. Hope this helps.

bschilder commented 11 months ago

@bschilder MedGen may be able to help with your disease definitions issue and cross-referencing terms by ID. Our FTP https://ftp.ncbi.nlm.nih.gov/pub/medgen/ files include a file of definitions collected by MedGen- MGDEF.RRF.gz

File: MGDEF.RRF.gz

Summary data for definitions and sources of concepts.

  • CUI: concept unique identifier
  • DEF: concept definition. Please see NOTE below
  • source: source of the definition
  • SUPPRESS: suppressed by UMLS curators (no reason is reported)

NOTE : Please note that some values in the DEF column contain internal line feeds. The line separator for RRF files is '|\n'. The line separator within the DEF column of MGDEF.RRF is '\r', CR (Carriage return, '\r', 0x0D, 13 in decimal). Unix/Linux and windows tool sometimes behave differently on these formats. If this format is problematic for you, consider use of the comma-separated value (csv) files in the csv subdirectory. (https://ftp.ncbi.nlm.nih.gov/pub/medgen/csv/)

We have a separate report that links the CUIs to the various database identifiers that are mapped in MedGen: "MedGenIDMappings.txt.gz"

File: MedGenIDMappings.txt.gz

Summary data for MedGen's assigned CUI, preferred name and other source database identifiers. MedGen has many data sources and aligns these ontologies and vocabularies to represent unified disease concepts. The various identifiers are mapped to a CUI and MedGen's preferred name for the concept.

  • CUI: concept unique identifier
  • Preferred name: MedGen's preferred name
  • extrn_id: the identifier from the source
  • extrn_src: abbreviated name for the source

Sources reported: -SNOMED CT -MeSH -Human Phenotype Ontology (HPO) -Mondo -Online Mendelian Inheritance in Man (OMIM) -OMIM disease records -OMIM phenotypes from specific alleles -OMIM phenotypic series -OMIM included diseases -Orphanet -MedGen UID

Amazing, thanks so much for the detailed reply @kanems ! I've had a look through this data and I found it quite helpful for ID mapping across ontologies.

https://api.monarchinitiative.org/api/bioentity/disease/OMIM:154700?fetch_objects=false&unselect_evidence=true&exclude_automatic_assertions=false&get_association_counts=false&rows=1

This is the call we use on the HPO page to get disease definition. In the near future (~2 months) we will ingest the mondo data ourselves and provide that in the API. Hope this helps.

@iimpulse this is also super helpful. I hadn't realized the website was rendering dynamically by pulling resources via the Monarch API, very cool! And thank you, i think this will be an excellent resource to have in one easily used file! I'll continue integrating this into HPOExplorer to make sure they're as in-sync as possible.

pnrobinson commented 11 months ago

Thanks, everybody, I will close this for now but please open a new issue as needed!