Load `24.10_freeze1` version of the gentropy outputs

d0choa commented 1 month ago

The genetics ETL has produced consistent outputs stored in gs://ot_orchestration/releases/24.10_freeze1.

We want to load this data in OpenSearch and expose the data through the API to provide a more realistic version of the final data and unblock additional BE/FE work.

The pipeline is currently producing the next set of outputs, and I'm flagging the ones that need to be loaded.

❯ gsutil ls gs://ot_orchestration/releases/24.10_freeze1
gs://ot_orchestration/releases/24.10_freeze1/locus_to_gene_gold_standard.json
gs://ot_orchestration/releases/24.10_freeze1/biosample_index/ ## TO LOAD ##
gs://ot_orchestration/releases/24.10_freeze1/colocalisation/ ## TO LOAD (Contains 2 subdirectories with the same schema) ##
gs://ot_orchestration/releases/24.10_freeze1/credible_set/ ## TO LOAD ##
gs://ot_orchestration/releases/24.10_freeze1/gene_index/
gs://ot_orchestration/releases/24.10_freeze1/invalid_credible_set/
gs://ot_orchestration/releases/24.10_freeze1/invalid_study_index/
gs://ot_orchestration/releases/24.10_freeze1/locus_to_gene_feature_matrix/
gs://ot_orchestration/releases/24.10_freeze1/manifests/
gs://ot_orchestration/releases/24.10_freeze1/study_index/ ## TO LOAD ##
gs://ot_orchestration/releases/24.10_freeze1/variant_index/ ## TO LOAD ##
gs://ot_orchestration/releases/24.10_freeze1/variants/

@DSuveges handcrafted previously loaded datasets. This version of the data comes straight from the ETL in parquet format.

A frozen description of all schemas can be found here.

The changes in the schemas are not significant compared to previous iterations, and we don't have many more planned changes. All the changes in this version that I can remember (sorry if I miss something):

studyLocusId is now a string straight from the ETL. There is a mixture of integer and hexadecimal hashes now, but they will all be hexadecimal in the final version.
studyType is a new column in credible sets
confidence is a new column in credible sets
study_index has two new columns, sumStatQCPerformed and sumStatQCValues

A probably incomplete list of things that we know are pending in this dataset from a semantic perspective:

PICS results based on major ancestry
Newly harmonised sumstats + sumstats QC
GWAS catalog study filtering based on curation + qc_sumstats
hasSummaryStats field in the study index to be properly populated
L2G predictions + evidence dataset
Additional credible set validation (e.g. confirming PP sum 0.99-1)
New GWAS Catalog SuSiE fine-mapping (pan-UKB LD reference, etc.)

@jdhayhurst let us know if we need to clarify something

jdhayhurst commented 1 month ago

I'm facing difficulties with the variant data so I compared the schema for variantIndex to the previous data I had and noticed that chromosome is missing from the new data. I see it's all partitioned by chromosome so that may be the reason? new:

Schema({'variantId': String, 'position': Int32, 'referenceAllele': String, 'alternateAllele': String, 'inSilicoPredictors': List(Struct({'method': String, 'assessment': String, 'score': Float32, 'assessmentFlag': String, 'targetId': String})), 'mostSevereConsequenceId': String, 'transcriptConsequences': List(Struct({'variantFunctionalConsequenceIds': List(String), 'aminoAcidChange': String, 'uniprotAccessions': List(String), 'isEnsemblCanonical': Boolean, 'codons': String, 'distanceFromFootprint': Int64, 'distanceFromTss': Int64, 'appris': String, 'maneSelect': String, 'targetId': String, 'impact': String, 'lofteePrediction': String, 'siftPrediction': Float32, 'polyphenPrediction': Float32, 'consequenceScore': Float32, 'transcriptIndex': Int32, 'transcriptId': String})), 'rsIds': List(String), 'hgvsId': String, 'alleleFrequencies': List(Struct({'populationName': String, 'alleleFrequency': Float64})), 'dbXrefs': List(Struct({'id': String, 'source': String}))})

old:

Schema({'variantId': String, 'chromosome': String, 'position': Int32, 'referenceAllele': String, 'alternateAllele': String, 'inSilicoPredictors': List(Struct({'method': String, 'assessment': String, 'score': Float32, 'assessmentFlag': String, 'targetId': String})), 'mostSevereConsequenceId': String, 'transcriptConsequences': List(Struct({'variantFunctionalConsequenceIds': List(String), 'aminoAcidChange': String, 'uniprotAccessions': List(String), 'isEnsemblCanonical': Boolean, 'codons': String, 'distanceFromFootprint': Int64, 'distanceFromTss': Int64, 'appris': String, 'maneSelect': String, 'targetId': String, 'impact': String, 'lofteePrediction': String, 'siftPrediction': Float32, 'polyphenPrediction': Float32, 'consequenceScore': Float32, 'transcriptIndex': Int32, 'transcriptId': String})), 'rsIds': List(String), 'hgvsId': String, 'alleleFrequencies': List(Struct({'populationName': String, 'alleleFrequency': Float64})), 'dbXrefs': List(Struct({'id': String, 'source': String}))})

d0choa commented 1 month ago

chromosome is used as a partitionBy column. If you read_parquet the whole variant_index directory, the column should be there. If you read each of the individual partitions, you will not see the column. If this is problematic, it could be done differently.

jdhayhurst commented 1 month ago

OK I see, I can enable that, but then data type is inferred for that field. So for chromosomes this results in mixed types. One option I can explore is passing the schema to the parquet to json convertor.

opentargets / issues

Load `24.10_freeze1` version of the gentropy outputs #3567