Describe the bugVariantEffectPredictorParser outputs variant_index table with incorrect schema.
This bug was found during running of the genetics_etl dag. During the variant_index step execution spark reported schema mismatch between gnomad annotations and variant_index generated by variant_annotation step.
The full parameter list used to run the variant_annotation and variant_index commands are provided in the dag configuration
[!CAUTION]
The difference is the schemas is that the vep based variant_index has array(array(struct(...))) rather then array(struct(...)) in the inSilicoPredictors field
The full analysis of the bug with some example how to overcome it in the dateproc notebook
Expected behaviour
Schema of vep based variant index should match variant_index schema
To Reproduce
in dataproc cluster
from gentropy.common.session import Session
from gentropy.common.session import Session
from gentropy.config import VariantIndexConfig
from gentropy.dataset.variant_index import VariantIndex
from gentropy.datasource.ensembl.vep_parser import VariantEffectPredictorParser
from gentropy.datasource.open_targets.variants import OpenTargetsVariant
from pyspark.sql import functions as F
from pyspark.sql import types as T
session = Session()
vep_output_json_path = "gs://ot_orchestration/releases/24.09.19/variants/annotated_variants"
variant_index_path = "gs://ot_orchestration/releases/24.09.19/variant_index"
gnomad_variant_annotations_path = "gs://genetics_etl_python_playground/static_assets/gnomad_variants"
hash_threshold = 300
variant_index = VariantEffectPredictorParser.extract_variant_index_from_vep(
session.spark, vep_output_json_path, hash_threshold
)
annotations = VariantIndex.from_parquet(
session=session,
path=gnomad_variant_annotations_path,
recursiveFileLookup=True,
)
variant_index.df.printSchema()
annotations.df.printSchema()
Additional context
The schema mismatch was not reported during tests, as the validation of the data frame schema contains bug, that do not allow for such a check - see #3545
Describe the bug
VariantEffectPredictorParser
outputs variant_index table with incorrect schema.This bug was found during running of the genetics_etl dag. During the
variant_index
step execution spark reported schema mismatch betweengnomad annotations
andvariant_index
generated byvariant_annotation
step.The full parameter list used to run the
variant_annotation
andvariant_index
commands are provided in the dag configurationThe failure dataproc job - see here
Observed behaviour The error thrown by spark
variant_index from vep_annotation step
``` root |-- variantId: string (nullable = false) |-- chromosome: string (nullable = true) |-- position: integer (nullable = true) |-- referenceAllele: string (nullable = true) |-- alternateAllele: string (nullable = true) |-- inSilicoPredictors: array (nullable = false) | |-- element: array (containsNull = true) | | |-- element: struct (containsNull = true) | | | |-- method: string (nullable = true) | | | |-- assessment: string (nullable = true) | | | |-- score: float (nullable = true) | | | |-- assessmentFlag: string (nullable = true) | | | |-- targetId: string (nullable = true) |-- mostSevereConsequenceId: string (nullable = true) |-- hgvsId: string (nullable = true) |-- transcriptConsequences: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- variantFunctionalConsequenceIds: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- consequenceScore: float (nullable = true) | | |-- aminoAcidChange: string (nullable = true) | | |-- uniprotAccessions: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- isEnsemblCanonical: boolean (nullable = false) | | |-- codons: string (nullable = true) | | |-- distanceFromFootprint: long (nullable = true) | | |-- distanceFromTss: long (nullable = true) | | |-- appris: string (nullable = true) | | |-- maneSelect: string (nullable = true) | | |-- targetId: string (nullable = true) | | |-- impact: string (nullable = true) | | |-- lofteePrediction: string (nullable = true) | | |-- siftPrediction: float (nullable = true) | | |-- polyphenPrediction: float (nullable = true) | | |-- transcriptId: string (nullable = true) | | |-- transcriptIndex: integer (nullable = false) |-- rsIds: array (nullable = true) | |-- element: string (containsNull = true) |-- alleleFrequencies: array (nullable = false) | |-- element: struct (containsNull = true) | | |-- populationName: string (nullable = true) | | |-- alleleFrequency: double (nullable = true) |-- dbXrefs: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: string (nullable = true) | | |-- source: string (nullable = true) ```variant_index from gnomad annotations step
``` root |-- variantId: string (nullable = true) |-- chromosome: string (nullable = true) |-- position: integer (nullable = true) |-- referenceAllele: string (nullable = true) |-- alternateAllele: string (nullable = true) |-- inSilicoPredictors: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- method: string (nullable = true) | | |-- assessment: string (nullable = true) | | |-- score: float (nullable = true) | | |-- assessmentFlag: string (nullable = true) | | |-- targetId: string (nullable = true) |-- mostSevereConsequenceId: string (nullable = true) |-- transcriptConsequences: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- variantFunctionalConsequenceIds: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- aminoAcidChange: string (nullable = true) | | |-- uniprotAccessions: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- isEnsemblCanonical: boolean (nullable = true) | | |-- codons: string (nullable = true) | | |-- distanceFromFootprint: long (nullable = true) | | |-- distanceFromTss: long (nullable = true) | | |-- appris: string (nullable = true) | | |-- maneSelect: string (nullable = true) | | |-- targetId: string (nullable = true) | | |-- impact: string (nullable = true) | | |-- lofteePrediction: string (nullable = true) | | |-- siftPrediction: float (nullable = true) | | |-- polyphenPrediction: float (nullable = true) | | |-- consequenceScore: float (nullable = true) | | |-- transcriptIndex: integer (nullable = true) | | |-- transcriptId: string (nullable = true) |-- rsIds: array (nullable = true) | |-- element: string (containsNull = true) |-- hgvsId: string (nullable = true) |-- alleleFrequencies: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- populationName: string (nullable = true) | | |-- alleleFrequency: double (nullable = true) |-- dbXrefs: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: string (nullable = true) | | |-- source: string (nullable = true) ```The full analysis of the bug with some example how to overcome it in the dateproc notebook
Expected behaviour Schema of vep based variant index should match variant_index schema
To Reproduce in dataproc cluster
Additional context The schema mismatch was not reported during tests, as the validation of the data frame schema contains bug, that do not allow for such a check - see #3545