opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Predicted pockets: Loading the data on backend #3323

Open prashantuniyal02 opened 1 month ago

prashantuniyal02 commented 1 month ago

The predicted pockets data need to be uploaded on the backend. This would imply changes at multiple steps in the BE processes.

Background

This issue relates to the BE work for enabling Predicted pockets data updates in PPP. The data related issue is: #3296

The most current pocket file is here: gs://otar000-evidence_input/predicted_binding_sites/2024.05.17/23.10.16_af2_human_pocket_summary.tsv (Currently there's no process in place to version or update this file. Ad hoc updates are expected at the moment)

Tasks

The plan at the moment:

[1] Schema: ``` root |-- uniprotId: string (nullable = true) |-- structureId: string (nullable = true) |-- pocketId: integer (nullable = true) |-- pocketResidues: array (nullable = true) | |-- element: integer (containsNull = true) |-- pocketScore: float (nullable = true) ```
Sample data ``` +----------+-------------+--------+--------------------+-----------+ | uniprotId| structureId|pocketId| pocketResidues|pocketScore| +----------+-------------+--------+--------------------+-----------+ |A0A024RBG1|A0A024RBG1-F1| 1|[2, 3, 4, 5, 6, 7...| 979.4576| |A0A024RBG1|A0A024RBG1-F1| 2|[54, 57, 58, 60, ...| 938.22205| |A0A075B6H5|A0A075B6H5-F1| 1|[7, 8, 9, 10, 11,...| 891.6838| |A0A075B6H5|A0A075B6H5-F1| 2|[82, 83, 84, 85, ...| 926.75757| |A0A075B6H7|A0A075B6H7-F1| 1|[26, 27, 28, 29, ...| 981.65485| |A0A075B6H8|A0A075B6H8-F1| 1|[28, 29, 30, 31, ...| 979.80493| |A0A075B6H8|A0A075B6H8-F1| 2|[58, 59, 60, 61, ...| 931.4943| |A0A075B6H8|A0A075B6H8-F1| 3|[19, 20, 21, 22, ...| 836.25006| |A0A075B6H8|A0A075B6H8-F1| 4|[23, 24, 25, 26, ...| 829.20355| |A0A075B6H9|A0A075B6H9-F1| 1|[24, 25, 26, 27, ...| 968.0503| |A0A075B6I0|A0A075B6I0-F1| 1|[28, 29, 30, 31, ...| 976.313| |A0A075B6I1|A0A075B6I1-F1| 1|[25, 26, 27, 28, ...| 975.2218| |A0A075B6I3|A0A075B6I3-F1| 1|[23, 24, 25, 26, ...| 978.2658| |A0A075B6I4|A0A075B6I4-F1| 1|[24, 25, 26, 27, ...| 980.3411| |A0A075B6I6|A0A075B6I6-F1| 1|[20, 21, 22, 23, ...| 972.134| |A0A075B6I7|A0A075B6I7-F1| 1|[21, 22, 23, 25, ...| 901.32117| |A0A075B6I9|A0A075B6I9-F1| 1|[25, 26, 27, 28, ...| 980.1896| |A0A075B6I9|A0A075B6I9-F1| 2|[58, 59, 60, 63, ...| 933.25616| |A0A075B6J1|A0A075B6J1-F1| 1|[21, 22, 23, 24, ...| 970.94403| |A0A075B6J2|A0A075B6J2-F1| 1|[23, 24, 25, 26, ...| 965.884| +----------+-------------+--------+--------------------+-----------+ only showing top 20 rows ```
[2] Code prototype: ``` @f.udf(t.ArrayType(t.IntegerType())) def parse_residues(res: str) -> List[int]: return json.loads(res) pocket_score_threshold = 800.0 file_name = 'gs://otar000-evidence_input/predicted_binding_sites/2024.05.17/23.10.16_af2_human_pocket_summary.tsv' ( spark.read.csv(file_name, header=True, sep='\t') .select( f.split(f.col('struct_id'), '-')[0].alias('uniprotId'), f.col('struct_id').alias('structureId'), f.col('pocket_id').cast(t.IntegerType()).alias('pocketId'), parse_residues(f.col('pocket_resid')).alias('pocketResidues'), # Pocket score is the scaled combined score: f.col("pocket_score_combined_scaled").cast(t.FloatType()).alias('pocketScore') ) # We might want to introduce some logic to filter un-reliable pockets: .filter(f.col('pocketScore') > pocket_score_threshold) .show() ) ```

Acceptance tests

How do we know the task is complete?