Predicted pockets: Loading the data on backend

The predicted pockets data need to be uploaded on the backend. This would imply changes at multiple steps in the BE processes.

Background

This issue relates to the BE work for enabling Predicted pockets data updates in PPP. The data related issue is: #3296

The most current pocket file is here: gs://otar000-evidence_input/predicted_binding_sites/2024.05.17/23.10.16_af2_human_pocket_summary.tsv (Currently there's no process in place to version or update this file. Ad hoc updates are expected at the moment)

Tasks

The plan at the moment:

[ ] PIS will pick up this file as is.
[ ] ETL will do a minor filter on the data. There are a bunch of columns, but Daniel thinks these columns are relevant [1]. The prototype of the code is here [2].
[ ] Pockets can be filtered by the scaled combined score. James S proposed 800 as a threshold.
[ ] The uniprotId column is expected to enable mapping the pockets to the target identifier. Daniel does not know the details of this process in the ETL, but there's something already in the evidence step.

[1] Schema:

Sample data

``` +----------+-------------+--------+--------------------+-----------+ | uniprotId| structureId|pocketId| pocketResidues|pocketScore| +----------+-------------+--------+--------------------+-----------+ |A0A024RBG1|A0A024RBG1-F1| 1|[2, 3, 4, 5, 6, 7...| 979.4576| |A0A024RBG1|A0A024RBG1-F1| 2|[54, 57, 58, 60, ...| 938.22205| |A0A075B6H5|A0A075B6H5-F1| 1|[7, 8, 9, 10, 11,...| 891.6838| |A0A075B6H5|A0A075B6H5-F1| 2|[82, 83, 84, 85, ...| 926.75757| |A0A075B6H7|A0A075B6H7-F1| 1|[26, 27, 28, 29, ...| 981.65485| |A0A075B6H8|A0A075B6H8-F1| 1|[28, 29, 30, 31, ...| 979.80493| |A0A075B6H8|A0A075B6H8-F1| 2|[58, 59, 60, 61, ...| 931.4943| |A0A075B6H8|A0A075B6H8-F1| 3|[19, 20, 21, 22, ...| 836.25006| |A0A075B6H8|A0A075B6H8-F1| 4|[23, 24, 25, 26, ...| 829.20355| |A0A075B6H9|A0A075B6H9-F1| 1|[24, 25, 26, 27, ...| 968.0503| |A0A075B6I0|A0A075B6I0-F1| 1|[28, 29, 30, 31, ...| 976.313| |A0A075B6I1|A0A075B6I1-F1| 1|[25, 26, 27, 28, ...| 975.2218| |A0A075B6I3|A0A075B6I3-F1| 1|[23, 24, 25, 26, ...| 978.2658| |A0A075B6I4|A0A075B6I4-F1| 1|[24, 25, 26, 27, ...| 980.3411| |A0A075B6I6|A0A075B6I6-F1| 1|[20, 21, 22, 23, ...| 972.134| |A0A075B6I7|A0A075B6I7-F1| 1|[21, 22, 23, 25, ...| 901.32117| |A0A075B6I9|A0A075B6I9-F1| 1|[25, 26, 27, 28, ...| 980.1896| |A0A075B6I9|A0A075B6I9-F1| 2|[58, 59, 60, 63, ...| 933.25616| |A0A075B6J1|A0A075B6J1-F1| 1|[21, 22, 23, 24, ...| 970.94403| |A0A075B6J2|A0A075B6J2-F1| 1|[23, 24, 25, 26, ...| 965.884| +----------+-------------+--------+--------------------+-----------+ only showing top 20 rows ```

[2] Code prototype:

``` @f.udf(t.ArrayType(t.IntegerType())) def parse_residues(res: str) -> List[int]: return json.loads(res) pocket_score_threshold = 800.0 file_name = 'gs://otar000-evidence_input/predicted_binding_sites/2024.05.17/23.10.16_af2_human_pocket_summary.tsv' ( spark.read.csv(file_name, header=True, sep='\t') .select( f.split(f.col('struct_id'), '-')[0].alias('uniprotId'), f.col('struct_id').alias('structureId'), f.col('pocket_id').cast(t.IntegerType()).alias('pocketId'), parse_residues(f.col('pocket_resid')).alias('pocketResidues'), # Pocket score is the scaled combined score: f.col("pocket_score_combined_scaled").cast(t.FloatType()).alias('pocketScore') ) # We might want to introduce some logic to filter un-reliable pockets: .filter(f.col('pocketScore') > pocket_score_threshold) .show() ) ```

Acceptance tests

How do we know the task is complete?

opentargets / issues