Closed jdhayhurst closed 5 months ago
I've created some dummy data that may be useful here. The keys are not set in stone, but this is what I'm using for prototyping the API.
target facet search data:
{"label":"Advanced Clinical","category":"Tractability Antibody","entityIds":["ENSG00000066468", "ENSG00000113721"]}
{"label":"Human Protein Atlas loc","category":"Tractability Antibody","entityIds":["ENSG00000081923", "ENSG00000168036"]}
{"label":"GO CC high conf","category":"Tractability Antibody","entityIds":["ENSG00000073734", "ENSG00000081923", "ENSG00000005471"]}
{"label":"GO CC med conf","category":"Tractability Antibody","entityIds":["ENSG00000044446"]}
{"label":"Adhesion","category":"Target Classes","entityIds":["ENSG00000168036"]}
{"label":"Transcription factor","category":"Target Classes","entityIds":["ENSG00000141510", "ENSG00000012504"]}
{"label":"Transporter","category":"Target lasses","entityIds":["ENSG00000073734", "ENSG00000005471"]}
disease facet search:
{"label":"Biological_process","category":"Therapeutic Areas","entityIds":["EFO_0000544", "EFO_0002950"]}
{"label":"Cardiovascular disease","category":"Therapeutic Areas","entityIds":["MONDO_0005277", "EFO_0001645"]}
{"label":"Hematologic disease","category":"Therapeutic Areas","entityIds":["EFO_0004246", "MONDO_0011382", "EFO_0007444"]}
{"label":"Infectious disease","category":"Therapeutic Areas","entityIds":["EFO_0007328"]}
{"label":"Measurement","category":"Therapeutic Areas","entityIds":["EFO_0005208"]}
{"label":"Phenotype","category":"Therapeutic Areas","entityIds":["HP_0002315", "HP_0100607"]}
{"label":"Psychiatric disorder","category":"Therapeutic Areas","entityIds":["EFO_0005611", "MONDO_0004975"]}
After some iterations I have created two dataframes containing the data discussed above by @jdhayhurst and sent to him for revision.
TargetId and categoryId are collected for each combination of categoryType
and categoryLabel
. This dataframe contain the information for approvedSymbol, go_terms, subcellular location, target class and pathway, extracted from the target file from the public platform (version 24.03). The schema is the following:
|-- categoryType: string (nullable = true)
|-- categoryLabel: string (nullable = true)
|-- targetId: array (nullable = false)
| |-- element: string (containsNull = false)
|-- categoryId: array (nullable = false)
| |-- element: string (containsNull = false)
DiseaseId and therapeuticAreasId are the data included, extracted from the disease file from the public platform (version 24.03), with the following schema:
|-- categoryType: string (nullable = false)
|-- categoryLabel: string (nullable = true)
|-- diseaseId: array (nullable = false)
| |-- element: string (containsNull = true)
|-- categoryId: array (nullable = false)
| |-- element: string (containsNull = true)
In the disease subset, each disease displays its name in categoryLabel
and its identifier in both diseaseId
and categoryId
columns.
In therapeuticAreas subset, all diseases are aggregated by their corresponding therapeuticAreas, so diseaseId
column is a list of all the ones belonging to the given therapeutic area, while the therapeutic area name is shown in the categoryLabel
and its identifier in the categoryId
column.
An example of this dataframe:
| categoryType| categoryLabel| diseaseId| categoryId|
|therapeutic area|reproductive syst...|[OTAR_0000017, EF...| [OTAR_0000017]|
| disease|angioimmunoblasti...| [EFO_0000255]| [EFO_0000255]|
Discussed on 16/4/24: we plan to replace the tractability filter from the old association page as well at this stage - e..g. filtering by a modality
the facet search will return all targets that have been targeted by that modality drug (e.g. antibody
, small molecule
OR searching by phase 1 clinical
the facet will return all targets with drugs in phase 1 clinical trial.
As discussed with the team, I am adding here some search examples for them to explore the dataset and feasibility:
small molecule
, antibody
, phase 1 clinical
, approved drug
, protac
, high quality pocket
, druggable family
, advanced clinical
Thanks @buniello and @Juanmaria-rr, I've managed to merge in the tractability data for all the modalities in this branch of the etl. Once the etl is complete, I'll rerun everything and release to dev for testing
We need to generate a facet search OpenSearch index. The purpose is to resolve targets/diseases from their facets, e.g. search for all the targets where facet is "search term".
There are two indices to create:
The schema is the same for both.
1. target facet search
The elements from the "target" schema that we want to include:
Unless stated, the field name from the schema above become the
category
and the values become thelabel
, thetargetIds
will be the array of all the target id(s) aggregated on thelabel
value. For instance, aggregate on approvedSymbol and settargetIds
to the array of target.id. Then repeat for approvedName and so on. GO will need to draw from the target - gene ontology data. There should also be a category for target id, which will simply contain a label of the target id, and an array (of one) with that target id.2. disease facet search
As above in terms of implementation:
Background
See here for more details: https://github.com/opentargets/issues/issues/3239)
Tasks
Acceptance tests
How do we know the task is complete?