opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Generate facet search data #3268

Open jdhayhurst opened 3 months ago

jdhayhurst commented 3 months ago

We need to generate a facet search OpenSearch index. The purpose is to resolve targets/diseases from their facets, e.g. search for all the targets where facet is "search term".

There are two indices to create:

  1. target facet search where the "rows" or documents are targets
  2. disease facet search where the "rows" or documents are diseases

The schema is the same for both.

- id: string, identifier of the facet
- category: string, parent category that the facet belongs to
- label: string, facet label that users will search 
- targetIds/diseaseIds: array, array of ids of the targets/disease (depending on whether it is the target or disease facet index) matching that facet

1. target facet search

The elements from the "target" schema that we want to include:

target
      ├───id : string
      ├───approvedSymbol : string
      ├───approvedName : string
      ├───go: array 
      │   ├───element: struct 
      │   │   ├───id : string -> resolve "Target - gene ontology" go.name (label)
      │   │   ├───aspect : string (category) 
      ├───subcellularLocations: array 
      │   ├───element: struct 
      │   │   ├───location : string
      ├───targetClass: array (category: chembl target class)
      │   ├───element: struct 
      │   │   ├───label : string
      ├───pathways: array (category: reactome)
      │   ├───element: struct 
      │   │   ├───pathway : string 

Unless stated, the field name from the schema above become the category and the values become the label, the targetIds will be the array of all the target id(s) aggregated on the label value. For instance, aggregate on approvedSymbol and set targetIds to the array of target.id. Then repeat for approvedName and so on. GO will need to draw from the target - gene ontology data. There should also be a category for target id, which will simply contain a label of the target id, and an array (of one) with that target id.

2. disease facet search

As above in terms of implementation:

disease
      ├───id : string
      ├───name : string
      ├───therapeuticAreas: array 

Background

See here for more details: https://github.com/opentargets/issues/issues/3239)

Tasks

Acceptance tests

How do we know the task is complete?

  1. When I run the ETL the facet search data are generated
jdhayhurst commented 2 months ago

I've created some dummy data that may be useful here. The keys are not set in stone, but this is what I'm using for prototyping the API.

target facet search data:

{"label":"Advanced Clinical","category":"Tractability Antibody","entityIds":["ENSG00000066468", "ENSG00000113721"]}
{"label":"Human Protein Atlas loc","category":"Tractability Antibody","entityIds":["ENSG00000081923", "ENSG00000168036"]}
{"label":"GO CC high conf","category":"Tractability Antibody","entityIds":["ENSG00000073734", "ENSG00000081923", "ENSG00000005471"]}
{"label":"GO CC med conf","category":"Tractability Antibody","entityIds":["ENSG00000044446"]}
{"label":"Adhesion","category":"Target Classes","entityIds":["ENSG00000168036"]}
{"label":"Transcription factor","category":"Target Classes","entityIds":["ENSG00000141510", "ENSG00000012504"]}
{"label":"Transporter","category":"Target lasses","entityIds":["ENSG00000073734", "ENSG00000005471"]}

disease facet search:

{"label":"Biological_process","category":"Therapeutic Areas","entityIds":["EFO_0000544", "EFO_0002950"]}
{"label":"Cardiovascular disease","category":"Therapeutic Areas","entityIds":["MONDO_0005277", "EFO_0001645"]}
{"label":"Hematologic disease","category":"Therapeutic Areas","entityIds":["EFO_0004246", "MONDO_0011382", "EFO_0007444"]}
{"label":"Infectious disease","category":"Therapeutic Areas","entityIds":["EFO_0007328"]}
{"label":"Measurement","category":"Therapeutic Areas","entityIds":["EFO_0005208"]}
{"label":"Phenotype","category":"Therapeutic Areas","entityIds":["HP_0002315", "HP_0100607"]}
{"label":"Psychiatric disorder","category":"Therapeutic Areas","entityIds":["EFO_0005611", "MONDO_0004975"]}
Juanmaria-rr commented 2 months ago

After some iterations I have created two dataframes containing the data discussed above by @jdhayhurst and sent to him for revision.

1. target facet search.

TargetId and categoryId are collected for each combination of categoryType and categoryLabel. This dataframe contain the information for approvedSymbol, go_terms, subcellular location, target class and pathway, extracted from the target file from the public platform (version 24.03). The schema is the following:

 |-- categoryType: string (nullable = true)
 |-- categoryLabel: string (nullable = true)
 |-- targetId: array (nullable = false)
 |    |-- element: string (containsNull = false)
 |-- categoryId: array (nullable = false)
 |    |-- element: string (containsNull = false)

2. diseases facet search.

DiseaseId and therapeuticAreasId are the data included, extracted from the disease file from the public platform (version 24.03), with the following schema:


 |-- categoryType: string (nullable = false)
 |-- categoryLabel: string (nullable = true)
 |-- diseaseId: array (nullable = false)
 |    |-- element: string (containsNull = true)
 |-- categoryId: array (nullable = false)
 |    |-- element: string (containsNull = true)

In the disease subset, each disease displays its name in categoryLabel and its identifier in both diseaseId and categoryId columns. In therapeuticAreas subset, all diseases are aggregated by their corresponding therapeuticAreas, so diseaseId column is a list of all the ones belonging to the given therapeutic area, while the therapeutic area name is shown in the categoryLabel and its identifier in the categoryId column.

An example of this dataframe:

|    categoryType|       categoryLabel|           diseaseId|     categoryId|
|therapeutic area|reproductive syst...|[OTAR_0000017, EF...| [OTAR_0000017]|
|         disease|angioimmunoblasti...|       [EFO_0000255]|  [EFO_0000255]|
buniello commented 2 months ago

Discussed on 16/4/24: we plan to replace the tractability filter from the old association page as well at this stage - e..g. filtering by a modality the facet search will return all targets that have been targeted by that modality drug (e.g. antibody, small molecule OR searching by phase 1 clinical the facet will return all targets with drugs in phase 1 clinical trial. As discussed with the team, I am adding here some search examples for them to explore the dataset and feasibility: small molecule, antibody, phase 1 clinical, approved drug, protac, high quality pocket, druggable family, advanced clinical

jdhayhurst commented 2 months ago

Thanks @buniello and @Juanmaria-rr, I've managed to merge in the tractability data for all the modalities in this branch of the etl. Once the etl is complete, I'll rerun everything and release to dev for testing