opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

New phenotypes data from EFO #1212

Closed andrewhercules closed 3 years ago

andrewhercules commented 3 years ago

With the release of EFO 3.23.0, there will be new phenotype data that we can integrate into the Platform to enrich the disease index and display on the disease profile page

d0choa commented 3 years ago

I'll extract some numbers from the efo_otar_profile.owl file that it's supposed to contain the disease-phenotype relationships from Monarch using the next axiom:

?d skos:related ?p

As a side note, this profile contains all the HP elements that map to the diseases. This could potentially mean a lot more terms than the "official" EFO release. I would try to get some numbers on that as well.

@hammer has previously shown interest on these activities, in case you or @dhimmel have any input

Upstream work here https://github.com/EBISPOT/efo/issues/794

d0choa commented 3 years ago

SPARQL query

Using efo_otar_profile.owl v3.23.0

# SPARQL query to get phenotype diseases
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix rdf: <http://www.w3.org/2000/01/rdf-schema#>
prefix skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?d ?d_label ?p ?p_label
WHERE {
  ?d a owl:Class .
  ?d rdf:label ?d_label .
  ?d skos:related ?p .
  ?p rdf:label ?p_label
}

Output sample

d d_label p p_label
http://www.orpha.net/ORDO/Orphanet_96121 7q11.23 microduplication syndrome http://purl.obolibrary.org/obo/HP_0000577 Exotropia
http://www.orpha.net/ORDO/Orphanet_96121 7q11.23 microduplication syndrome http://purl.obolibrary.org/obo/HP_0001382 Joint hypermobility
http://www.orpha.net/ORDO/Orphanet_96121 7q11.23 microduplication syndrome http://purl.obolibrary.org/obo/HP_0001999 Abnormal facial shape
http://www.orpha.net/ORDO/Orphanet_96121 7q11.23 microduplication syndrome http://purl.obolibrary.org/obo/HP_0000023 Inguinal hernia
http://www.orpha.net/ORDO/Orphanet_96121 7q11.23 microduplication syndrome http://purl.obolibrary.org/obo/HP_0000486 Strabismus
http://www.orpha.net/ORDO/Orphanet_96121 7q11.23 microduplication syndrome http://purl.obolibrary.org/obo/HP_0000256 Macrocephaly
http://www.orpha.net/ORDO/Orphanet_96121 7q11.23 microduplication syndrome http://purl.obolibrary.org/obo/HP_0002119 Ventriculomegaly

Some stats

unique counts

unique counts
diseases 3046
phenotypes 1118

most frequent phenotypes

p_label count
Seizure 679
Intellectual disability 677
Global developmental delay 568
Microcephaly 458
Muscular hypotonia 424
Hypertelorism 416
Micrognathia 376
Strabismus 362
Nystagmus 297
Cleft palate 294
Ataxia 284
Epicanthus 271
Failure to thrive 269
Sensorineural hearing impairment 242
Downslanted palpebral fissures 230
High palate 224
Clinodactyly of the 5th finger 215
Short neck 214
Low-set ears 212
Wide nasal bridge 211

diseases with more phenotypes

d_label count
Williams syndrome 87
Distal monosomy 10q 63
22q11.2 deletion syndrome 61
Wiedemann-Rautenstrauch syndrome 60
1p36 deletion syndrome 59
7q11.23 microduplication syndrome 58
Oculocerebrorenal syndrome 57
Schwartz-Jampel syndrome 54
Acroosteolysis dominant type 53
Cornelia de Lange syndrome 53
2p15p16.1 microdeletion syndrome 52
Peters plus syndrome 52
PMM2-CDG 50
Wolf-Hirschhorn syndrome 49
Fanconi anemia 48
Smith-Magenis syndrome 48
ADNP-related multiple congenital anomalies-intellectual disability-autism spectrum disorder 47
Cardiac anomalies-developmental delay-facial dysmorphism syndrome 47
Intellectual disability-feeding difficulties-developmental delay-microcephaly syndrome 47
Smith-Lemli-Opitz syndrome 47
dhimmel commented 3 years ago

@d0choa thanks for tagging me. Cool to see the SPARQL query and the results.

Some general questions about the approach (I'm new here):

  1. In this context, can we think of phenotypes as signs or symptoms of the disease?
  2. Are diseases and phenotypes in the above queries distinct sets of terms? Or could there be ?disease skos:related ?phenotype_1 as well as ?phenotype_1 skos:related ?phenotype_2, such that ?phenotype_1 is both a phenotype of a disease, and a disease in and of itself?
  3. Will these relationships be included in the main efo.owl or just in efo_otar_profile.owl? What is efo_otar_profile.owl?
d0choa commented 3 years ago

Hi @dhimmel,

  1. As far as I know MONDO collects this data from Orphanet and OMIM. They refer to them as clinical signs and symptoms.
  2. I believe all diseases have EFO/MONDO ids whereas all phenotypes have HP ids which will prevent these type of chains. However, I haven't actually tested it.
  3. Not sure what the plans are for the official owl on adopting this changes. The Open targets OWLs contain a slightly different structure in the high-level terms than the official EFO. They try to be more aligned with other clinical ontologies such as Meddra. For example, they ignore organisational terms such as disease by anatomical system. At the moment, the phenotypes have been only implemented in the profile but they will soon be propagated to the slim. Once this is done, the differences between the 2 would be merely technical (e.g. the profile is rooted whereas the slim is not).
zoependlington commented 3 years ago

Just for some (related) clarification on the nature of the d2p links:

  1. Most of the links come from Monarch, but the raw source for the Monarch dump is phenotype.hpoa, hosted as part of the HPO project, which is documented here (fyi, you can see how the d2p module is constructed in EFO here)
  2. Some of the links come directly from Mondo (a minority). Note that there are axiom annotations which have the metadata attached to determine the source. I did not test this query, but this way you can get the source.
    prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    prefix owl: <http://www.w3.org/2002/07/owl#>
    prefix OBAN: <http://purl.org/oban/>
    prefix dc: <http://purl.org/dc/elements/1.1/>
    prefix RO: <http://purl.obolibrary.org/obo/RO_>
    SELECT ?disease ?d_label ?phenotype ?p_label ?source
    WHERE { 
    ?disease rdf:type owl:Class ;
    <http://www.w3.org/2004/02/skos/core#related> ?phenotype .
    [ rdf:type owl:Axiom ;
     owl:annotatedSource ?disease ;
     owl:annotatedProperty <http://www.w3.org/2004/02/skos/core#related> ;
     owl:annotatedTarget ?phenotype ;
     dc:source ?source  ] .
    OPTIONAL {
    ?disease rdf:label ?d_label .
    }
    OPTIONAL {
    ?phenotype rdf:label ?p_label .
    }
    }
  3. Its unlikely we will add the whole thing to efo.owl proper (just the otar profile), as this blows up EFO too much, at least in the foreseeable future..
  4. Its easy to import the remaining HPO terms into the otar profile and we can do this for the next release!
matentzn commented 3 years ago

One thing I would like you to be aware of that this "related-to" kind of associative data is completely de-contextualised, i.e. noisy.

The monarch representation of these mappings have more rich contextual metadata, in particular:

Frequency qualifiers, qualification of onset, sex-specificity and evidence codes from ECO, for example:

MONARCH:b00571f419549ef8081e a OBAN:association ;
    RO:0002558 ECO:0000269 ;
    dc:source PMID:27158779 ;
    OBAN:association_has_object HP:0003762 ;
    OBAN:association_has_predicate RO:0002200 ;
    OBAN:association_has_subject OMIM:617925 ;
    :frequencyOfPhenotype "1/1" ;
    :has_sex_specificity PATO:0000383 ;
    :onset HP:0003577 .

This information may be very important for your analyses-> if you need those in EFO, you should make a ticket to that end (they are easy to add in - they are just many). You can also see in this example that more semantically meaningful links are given by Monarch (i.e. RO:0002200) - which may be worthy of consideration for you as well..

Note that the original data also has negated information, so disease2phenotype associations that explicitly do not hold. This information has been filtered out for the Monarch data dump. If you are interested how the monarch TTL representation of the HPOA files comes about, check here. I love this subject and I am happy to help @zoependlington to make this as useful as possible for you!

d0choa commented 3 years ago

As for our 20.11 release, we have used the 3.24 slim containing all the phenotype terms linked to diseases as described above. Users can search for the terms, but there is not yet any information on which phenotypes are linked to which diseases.

Given the comments raised above by @zoependlington and @matentzn, we are trying to import the enriched Monarch metadata for the relationships. @cmalangone is working directly with the Monarch data dump. Once we have a preliminary working version, we might ask for feedback.

Tagging for our 21.02 release

cmalangone commented 3 years ago

Some checks about these last indices:

Things to fix or to do:

matentzn commented 3 years ago

Hey @cmalangone Does this mean you don't need the links anymore from the EFO OTAR profile? You take them directly from Monarch?

d0choa commented 3 years ago

hi @matentzn, we are still prototyping it. At the moment, we are trying to get the relationships from Monarch directly together with all the rich metadata that you suggested. We'll let you know if this works for us. @cmalangone might have some questions for you, though.

matentzn commented 3 years ago

Always! :) Let me know how I can help.

cmalangone commented 3 years ago

hi @matentzn, can you please get in touch with me? I have a couple of good example with different info. Thanks

matentzn commented 3 years ago

Sent you an email!

cmalangone commented 3 years ago

Phenotypes are integrated with Disease (efo-ontology) The resources involved are hpo_phenotypes info: uri: http://compbio.charite.de/jenkins/job/hpo.annotations.current/lastSuccessfulBuild/artifact/current/phenotype.hpoa hp ontology: uri: https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/master/hp.owl mondo ontology: uri: https://github.com/monarch-initiative/mondo/releases/download/v2020-12-18/mondo.owl

PIS, ETL and GraphQL were updated accordingly with the new format/data.

We implemented a final test in order to check if the related:sko info are available using the phenotypes entries.

matentzn commented 3 years ago

One thing that has nothing to do with the ticket but I noticed: You should never use the raw github URLs for referring to any of these ontologies and data sets. For example: The charite link for hpoa is already stale. Moved elsewhere last month. hp.owl soon will grow beyond 100MB and then migrate away from Github to Github releases. Always use purls:

http://purl.obolibrary.org/obo/hp/hpoa/phenotype.hpoa http://purl.obolibrary.org/obo/hp.owl http://purl.obolibrary.org/obo/mondo.owl

HP and Mondo are both correctly versioned: http://purl.obolibrary.org/obo/hp/releases/2021-02-08/hp.owl http://purl.obolibrary.org/obo/mondo/releases/2021-01-29/mondo.owl

Talk soon!

dhimmel commented 3 years ago

You should never use the raw github URLs for referring to any of these ontologies and data sets.

Noting that it is sometimes useful to use commit-hash-versioned GitHub URLs to datasets, which can be generated by pressing y. There is always a chance a repo could rewrite history, so agree with @matentzn to use a community-approved permalink whenever possible.

HP and Mondo are both correctly versioned:

Didn't know about these versioned PURLs. Very useful. Looks like it also works with Gene Ontology like http://purl.obolibrary.org/obo/go/releases/2021-02-01/go-basic.json.gz.

IIUC EFO is not indexed by OBO Foundry, such that versioned links should go to the GitHub releases like https://github.com/EBISPOT/efo/releases/download/v3.27.0/efo_otar_slim.owl? Or is it preferred to do https://www.ebi.ac.uk/efo/releases/v3.27.0/efo_otar_slim.owl?

matentzn commented 3 years ago

Hey @dhimmel

Yeah, if you include the commit hash, I guess there are some use cases - however, as long as we are talking about ontology release files, using commit hash references should be equivalent to using the version IRI! :)

Versioned purls should really work for all OBO ontologies, but does not quite yet :{ all the bigger ones though have it.

Regarding EFO, good questions! Not sure whether the efo otar slim has a purl.. @zoependlington ?

zoependlington commented 3 years ago

The Open Targets slim and profile do have versioned purls. e.g. http://www.ebi.ac.uk/efo/releases/v3.14.0/efo_otar_slim.owl http://www.ebi.ac.uk/efo/releases/v3.14.0/efo_otar_profile.owl

d0choa commented 3 years ago

Thanks everyone for the useful feedback. We are definitely reviewing the use of permalinks in our codebase. #1395

dhimmel commented 3 years ago

Phenotypes are integrated with Disease (efo-ontology)

@cmalangone if I understand your comment correctly, you were saying that:

  1. you created a dataset of disease-to-phenotype links, where the diseases are EFO terms and the phenotypes are HP terms.
  2. you used MONDO and HP directly rather than efo_otar_profile.owl to get more of the "contextual metadata"
  3. This extraction was done as part of Open Targets

Is that correct? If so, is the code and/or output dataset available? I'd like the same thing, but don't want to re-implement the extraction if you've already done it.

matentzn commented 3 years ago

I am also interested in seeing this pipeline if its public!

cmalangone commented 3 years ago

@dhimmel : Sorry for the delay I was off. platform-input-support retrieves the last EFO (owl) file and transforms it into a json file. It downloads the http://purl.obolibrary.org/obo/hp/hpoa/phenotype.hpoa file too. https://github.com/opentargets/platform-input-support/blob/master/config.yaml#L89

The ETL step reads the EFO json file and it joins the field "dbXRefs" with the different IDs from MONDO (using the field 'id') and phenotype (using the field 'databaseId') resources.

Code here: https://github.com/opentargets/platform-etl-backend/blob/master/src/main/scala/io/opentargets/etl/backend/Disease.scala#L121 https://github.com/opentargets/platform-etl-backend/blob/master/src/main/scala/io/opentargets/etl/backend/Disease.scala#L122 https://github.com/opentargets/platform-etl-backend/blob/master/src/main/scala/io/opentargets/etl/backend/Disease.scala#L123

cmalangone commented 3 years ago

@dhimmel Info about inputs/outputs

INPUTS:

EFO owl file: https://storage.googleapis.com/open-targets-data-releases/21.06/input/annotation-files/ontology/efo_otar_slim.owl

Mondo owl file: open-targets-data-releases/21.06/input/annotation-files/ontology/mondo.owl

EFO json after platform-input-support: https://storage.googleapis.com/open-targets-data-releases/21.06/input/annotation-files/ontology/efo_json/ontology-efo-v3.31.0.jsonl

MONDO after platform-input-support: https://storage.googleapis.com/open-targets-data-releases/21.06/input/annotation-files/ontology/efo_json/ontology-mondo.jsonl

phenotype after platform-input-support: https://storage.googleapis.com/open-targets-data-releases/21.06/input/annotation-files/ontology/efo_json/hpo-phenotypes-2021-06-18.jsonl

OUTPUT: https://console.cloud.google.com/storage/browser/open-targets-data-releases/21.06/output/etl/json/diseases or gs://open-targets-data-releases/21.06/output/etl/json/diseases/*