Closed d0choa closed 1 month ago
I'm facing difficulties with the variant data so I compared the schema for variantIndex to the previous data I had and noticed that chromosome
is missing from the new data. I see it's all partitioned by chromosome so that may be the reason?
new:
Schema({'variantId': String, 'position': Int32, 'referenceAllele': String, 'alternateAllele': String, 'inSilicoPredictors': List(Struct({'method': String, 'assessment': String, 'score': Float32, 'assessmentFlag': String, 'targetId': String})), 'mostSevereConsequenceId': String, 'transcriptConsequences': List(Struct({'variantFunctionalConsequenceIds': List(String), 'aminoAcidChange': String, 'uniprotAccessions': List(String), 'isEnsemblCanonical': Boolean, 'codons': String, 'distanceFromFootprint': Int64, 'distanceFromTss': Int64, 'appris': String, 'maneSelect': String, 'targetId': String, 'impact': String, 'lofteePrediction': String, 'siftPrediction': Float32, 'polyphenPrediction': Float32, 'consequenceScore': Float32, 'transcriptIndex': Int32, 'transcriptId': String})), 'rsIds': List(String), 'hgvsId': String, 'alleleFrequencies': List(Struct({'populationName': String, 'alleleFrequency': Float64})), 'dbXrefs': List(Struct({'id': String, 'source': String}))})
old:
Schema({'variantId': String, 'chromosome': String, 'position': Int32, 'referenceAllele': String, 'alternateAllele': String, 'inSilicoPredictors': List(Struct({'method': String, 'assessment': String, 'score': Float32, 'assessmentFlag': String, 'targetId': String})), 'mostSevereConsequenceId': String, 'transcriptConsequences': List(Struct({'variantFunctionalConsequenceIds': List(String), 'aminoAcidChange': String, 'uniprotAccessions': List(String), 'isEnsemblCanonical': Boolean, 'codons': String, 'distanceFromFootprint': Int64, 'distanceFromTss': Int64, 'appris': String, 'maneSelect': String, 'targetId': String, 'impact': String, 'lofteePrediction': String, 'siftPrediction': Float32, 'polyphenPrediction': Float32, 'consequenceScore': Float32, 'transcriptIndex': Int32, 'transcriptId': String})), 'rsIds': List(String), 'hgvsId': String, 'alleleFrequencies': List(Struct({'populationName': String, 'alleleFrequency': Float64})), 'dbXrefs': List(Struct({'id': String, 'source': String}))})
chromosome
is used as a partitionBy
column. If you read_parquet
the whole variant_index
directory, the column should be there. If you read each of the individual partitions, you will not see the column. If this is problematic, it could be done differently.
OK I see, I can enable that, but then data type is inferred for that field. So for chromosomes this results in mixed types. One option I can explore is passing the schema to the parquet to json convertor.
The genetics ETL has produced consistent outputs stored in
gs://ot_orchestration/releases/24.10_freeze1
.We want to load this data in OpenSearch and expose the data through the API to provide a more realistic version of the final data and unblock additional BE/FE work.
The pipeline is currently producing the next set of outputs, and I'm flagging the ones that need to be loaded.
@DSuveges handcrafted previously loaded datasets. This version of the data comes straight from the ETL in parquet format.
A frozen description of all schemas can be found here.
The changes in the schemas are not significant compared to previous iterations, and we don't have many more planned changes. All the changes in this version that I can remember (sorry if I miss something):
studyLocusId
is now a string straight from the ETL. There is a mixture of integer and hexadecimal hashes now, but they will all be hexadecimal in the final version.studyType
is a new column in credible setsconfidence
is a new column in credible setssumStatQCPerformed
andsumStatQCValues
A probably incomplete list of things that we know are pending in this dataset from a semantic perspective:
@jdhayhurst let us know if we need to clarify something