opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Load `24.10_freeze1` version of the gentropy outputs #3567

Closed d0choa closed 1 month ago

d0choa commented 1 month ago

The genetics ETL has produced consistent outputs stored in gs://ot_orchestration/releases/24.10_freeze1.

We want to load this data in OpenSearch and expose the data through the API to provide a more realistic version of the final data and unblock additional BE/FE work.

The pipeline is currently producing the next set of outputs, and I'm flagging the ones that need to be loaded.

❯ gsutil ls gs://ot_orchestration/releases/24.10_freeze1
gs://ot_orchestration/releases/24.10_freeze1/locus_to_gene_gold_standard.json
gs://ot_orchestration/releases/24.10_freeze1/biosample_index/ ## TO LOAD ##
gs://ot_orchestration/releases/24.10_freeze1/colocalisation/ ## TO LOAD (Contains 2 subdirectories with the same schema) ##
gs://ot_orchestration/releases/24.10_freeze1/credible_set/ ## TO LOAD ##
gs://ot_orchestration/releases/24.10_freeze1/gene_index/
gs://ot_orchestration/releases/24.10_freeze1/invalid_credible_set/
gs://ot_orchestration/releases/24.10_freeze1/invalid_study_index/
gs://ot_orchestration/releases/24.10_freeze1/locus_to_gene_feature_matrix/
gs://ot_orchestration/releases/24.10_freeze1/manifests/
gs://ot_orchestration/releases/24.10_freeze1/study_index/ ## TO LOAD ##
gs://ot_orchestration/releases/24.10_freeze1/variant_index/ ## TO LOAD ##
gs://ot_orchestration/releases/24.10_freeze1/variants/

@DSuveges handcrafted previously loaded datasets. This version of the data comes straight from the ETL in parquet format.

A frozen description of all schemas can be found here.

The changes in the schemas are not significant compared to previous iterations, and we don't have many more planned changes. All the changes in this version that I can remember (sorry if I miss something):

A probably incomplete list of things that we know are pending in this dataset from a semantic perspective:

@jdhayhurst let us know if we need to clarify something

jdhayhurst commented 1 month ago

I'm facing difficulties with the variant data so I compared the schema for variantIndex to the previous data I had and noticed that chromosome is missing from the new data. I see it's all partitioned by chromosome so that may be the reason? new:

Schema({'variantId': String, 'position': Int32, 'referenceAllele': String, 'alternateAllele': String, 'inSilicoPredictors': List(Struct({'method': String, 'assessment': String, 'score': Float32, 'assessmentFlag': String, 'targetId': String})), 'mostSevereConsequenceId': String, 'transcriptConsequences': List(Struct({'variantFunctionalConsequenceIds': List(String), 'aminoAcidChange': String, 'uniprotAccessions': List(String), 'isEnsemblCanonical': Boolean, 'codons': String, 'distanceFromFootprint': Int64, 'distanceFromTss': Int64, 'appris': String, 'maneSelect': String, 'targetId': String, 'impact': String, 'lofteePrediction': String, 'siftPrediction': Float32, 'polyphenPrediction': Float32, 'consequenceScore': Float32, 'transcriptIndex': Int32, 'transcriptId': String})), 'rsIds': List(String), 'hgvsId': String, 'alleleFrequencies': List(Struct({'populationName': String, 'alleleFrequency': Float64})), 'dbXrefs': List(Struct({'id': String, 'source': String}))})

old:

Schema({'variantId': String, 'chromosome': String, 'position': Int32, 'referenceAllele': String, 'alternateAllele': String, 'inSilicoPredictors': List(Struct({'method': String, 'assessment': String, 'score': Float32, 'assessmentFlag': String, 'targetId': String})), 'mostSevereConsequenceId': String, 'transcriptConsequences': List(Struct({'variantFunctionalConsequenceIds': List(String), 'aminoAcidChange': String, 'uniprotAccessions': List(String), 'isEnsemblCanonical': Boolean, 'codons': String, 'distanceFromFootprint': Int64, 'distanceFromTss': Int64, 'appris': String, 'maneSelect': String, 'targetId': String, 'impact': String, 'lofteePrediction': String, 'siftPrediction': Float32, 'polyphenPrediction': Float32, 'consequenceScore': Float32, 'transcriptIndex': Int32, 'transcriptId': String})), 'rsIds': List(String), 'hgvsId': String, 'alleleFrequencies': List(Struct({'populationName': String, 'alleleFrequency': Float64})), 'dbXrefs': List(Struct({'id': String, 'source': String}))})
d0choa commented 1 month ago

chromosome is used as a partitionBy column. If you read_parquet the whole variant_index directory, the column should be there. If you read each of the individual partitions, you will not see the column. If this is problematic, it could be done differently.

jdhayhurst commented 1 month ago

OK I see, I can enable that, but then data type is inferred for that field. So for chromosomes this results in mixed types. One option I can explore is passing the schema to the parquet to json convertor.