opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Resolve data issues with new `target` ETL and API endpoint #1641

Closed andrewhercules closed 2 years ago

andrewhercules commented 3 years ago

This ticket captures preliminary findings after reviewing the development API with the new target ETL data.

19 July 2021: ticket still a work-in-progress as review ongoing

Mouse Phenotypes

Issues with MP should be compiled on a specific MP ticket.

Pathways

Comparative Genomics

Protein Information

Target Safety

TEPs

Cancer Hallmarks

Target tractability

Cancer Biomarkers

Screenshot 2021-07-19 at 10 25 16 Screenshot 2021-07-19 at 10 24 51 Screenshot 2021-07-19 at 10 23 03

Gene Ontology

Chemical Probes

Baseline expression

Target classes

JarrodBaker commented 3 years ago

Thanks @andrewhercules, I've moved the mouse phenotype section to a new ticket as a) it's a separate index (internally), and b) those issues existed on the current dataset. I think that @d0choa had mentioned that we need to rewrite MP, so that can maybe a starting point for those.

Please keep adding issues here and I'll triage as they come in. :)

JarrodBaker commented 3 years ago

@andrewhercules pathway was removed from the schema and is no longer available. There was a comment that we could potentially get this from 'reactome pathways' if necessary (comment on target refactor spreadsheet iteration1, but it hasn't been followed up as yet. I think the broad plan was to introduce a new 'geneSets' index of some sort using reactome data as a base. @d0choa will have a better idea.

d0choa commented 3 years ago

my memory might fail but I think the Reactome pathways (R-HSA-XXX) in the new implementation were coming as xrefs directly from Ensembl? Could you check that field?

Resolving the ids into labels was done by @mkarmona for the facets.

andrewhercules commented 3 years ago

my memory might fail but I think the Reactome pathways (R-HSA-XXX) in the new implementation were coming as xrefs directly from Ensembl? Could you check that field?

Resolving the ids into labels was done by @mkarmona for the facets.

I have checked the dbXrefs field and it only returns the Pathway ID and source. Currently, we display the pathway name and top-level parent pathway.

JarrodBaker commented 3 years ago

Regarding comparative genomics:

I think we're still missing a couple of entries that we should have though, so I'll keep digging.

JarrodBaker commented 3 years ago

Ticking off the missing 'rat' entry on TNF as the second entry is low confidence so we're excluding it intentionally:

+---------------+-----------------+-----------------------+------------------+
| gene_stable_id| homology_species|homology_gene_stable_id|is_high_confidence|
+---------------+-----------------+-----------------------+------------------+
|ENSG00000232810|rattus_norvegicus|     ENSRNOG00000055156|                 0|
|ENSG00000232810|rattus_norvegicus|     ENSRNOG00000000837|                 1|
+---------------+-----------------+-----------------------+------------------+

Regarding ESR1:

JarrodBaker commented 3 years ago

Regarding protein information (subcellular location), taking the example of APP (ENSG00000142192)

+---------------+---------------------------------------------------+
|id             |location                                           |
+---------------+---------------------------------------------------+
|ENSG00000142192|Cell membrane ; Single-pass type I membrane protein|
|ENSG00000142192|Cell projection, growth cone                       |
|ENSG00000142192|Cytoplasm                                          |
|ENSG00000142192|Cytoplasmic vesicle                                |
|ENSG00000142192|Early endosome                                     |
|ENSG00000142192|Golgi apparatus                                    |
|ENSG00000142192|Membrane ; Single-pass type I membrane protein     |
|ENSG00000142192|Membrane, clathrin-coated pit                      |
|ENSG00000142192|Perikaryon                                         |
|ENSG00000142192|Vesicles                                           |
|ENSG00000142192|[Amyloid-beta protein 42]: Cell surface            |
|ENSG00000142192|[Gamma-secretase C-terminal fragment 59]: Nucleus  |
|ENSG00000142192|[Soluble APP-beta]: Secreted                       |
+---------------+---------------------------------------------------+

All of these entries come from Uniprot flat files which we parse in the ETL. The documentation says that:

This is formally defined as:

The format of SUBCELLULAR LOCATION is:

           CC   -!- SUBCELLULAR LOCATION:(( Molecule:)?( Location\.)+)?( Note=Free_text( Flag)?\.)?

Where:

    Molecule: Isoform, chain or peptide name
    Location = Subcellular_location( Flag)?(; Topology( Flag)?)?(; Orientation( Flag)?)?
        Subcellular_location: SL-line of subcell.txt ID-record
        Topology: SL-line of subcell.txt IT-record
        Orientation: SL-line of subcell.txt IO-record

In the current Platform:

Uniprot:

In short, we've got a bit of a mix of both the current Platform and Uniprot. Any thoughts @andrewhercules and @d0choa on which way we want to go?

JarrodBaker commented 3 years ago

ChemicalProbes is yet to be implemented, see (opentargets/platform#1389). It's with @ireneisdoomed at present. Once she's back from leave she can provide an update.

JarrodBaker commented 3 years ago

Regarding Protein Information: the schema has changed so that evidence is a simple string. I've tried adding in eco codes as described in opentargets/platform#1037 but they are quite sparse. I'll revisit it again later. Unless I hear otherwise I'll assume that they are still a low priority (as mentioned in the linked ticket).

If nothing else the endpoint should be working!

JarrodBaker commented 3 years ago

I've temporarily removed targetSafety from the API as it's broken and unstable. This way other errors that are found will be more likely to be genuine errors.

tskir commented 3 years ago

Changes required to accommodate mousePhenotypes schema updates (https://github.com/opentargets/platform/issues/1471)

Comparison between the old and new schema is available in this spreadsheet (first four columns). Summary:

Also CCing @ireneisdoomed @DSuveges to keep them in the loop.

ireneisdoomed commented 3 years ago

Cancer Biomarkers Updates

This dataset describes association between target and disease when a cancer biomarker is found.

Because of this particularity of the presence of the biomarker, it was initially scoped to be part of the target annotations and not part of the evidence.

We have modelled and parsed the table so that the cancer biomarkers are a new data source of evidence for the upcoming release (PR #89)

The schema that this source follows is tracked here: https://docs.google.com/spreadsheets/d/1Mowq7KsGTMtEg3wZpJBNK_UbawHKJeM9d0syT9F9AMc/edit?usp=sharing

Therefore:

Back-end actions (CC @JarrodBaker)

Front-end actions (CC @andrewhercules)

andrewhercules commented 3 years ago

@ireneisdoomed, I have updated #1645 and have asked the front-end team to remove the Cancer Biomarkers summary widget and detail view from the target profile page.

When evidence details are ready, including the relevant fields and datasource name (e.g. Cancer Genome Interpreter) please let me know and I can create design specifications for a summary widget and detail view for the evidence page. The front-end team will also make sure the filters on the association page are updated. And we can adjust the documentation and include the new source on the evidence page.

I'm also CCing @HelenaCornu as we will want to explain this in our release comms if the change is ready for 21.09 :-)

ireneisdoomed commented 3 years ago

Chemical Probes Updates

Until now, chemical probes were generated by parsing the spreadsheet that the data team maintained and curated manually. From 21.09 on, Chemical Probes and information about the ProbeMiner score will come from probes-drugs.org, the integration of this resource being described in #1536.

The proposed model with all the different endpoints can be seen here (current version is iter3): https://docs.google.com/spreadsheets/d/1AqC6aqKgyf_s-R1LculocodpjcHRMw6t-pUoVsvryxs/edit?usp=sharing

The latest version of the output dataset is uploaded to the path we were using for ProbeMiner data (gs://otar001-core/ProbeMiner/annotation). I propose renaming the parent directory to accommodate

Actions:

JarrodBaker commented 3 years ago

@ireneisdoomed just confirming relating to cancer biomarkers and your comment on 9 August: you don't want any cancer biomarker information to be available via the API?

andrewhercules commented 2 years ago

Ticket closed as target part of the 21.09 release