VariantEffectPredictor outputs variant_index with incorrect schema - Githubissues

opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal

https://platform.opentargets.org https://genetics.opentargets.org

Apache License 2.0

12 stars 2 forks source link

VariantEffectPredictor outputs variant_index with incorrect schema #3546

Closed project-defiant closed 1 month ago

project-defiant commented 1 month ago

Describe the bug VariantEffectPredictorParser outputs variant_index table with incorrect schema.

This bug was found during running of the genetics_etl dag. During the variant_index step execution spark reported schema mismatch between gnomad annotations and variant_index generated by variant_annotation step.

The full parameter list used to run the variant_annotation and variant_index commands are provided in the dag configuration

The failure dataproc job - see here

Observed behaviour The error thrown by spark

pyspark.sql.utils.AnalysisException: cannot resolve 'array_union(inSilicoPredictors, annotation_inSilicoPredictors)

variant_index from vep_annotation step

``` root |-- variantId: string (nullable = false) |-- chromosome: string (nullable = true) |-- position: integer (nullable = true) |-- referenceAllele: string (nullable = true) |-- alternateAllele: string (nullable = true) |-- inSilicoPredictors: array (nullable = false) | |-- element: array (containsNull = true) | | |-- element: struct (containsNull = true) | | | |-- method: string (nullable = true) | | | |-- assessment: string (nullable = true) | | | |-- score: float (nullable = true) | | | |-- assessmentFlag: string (nullable = true) | | | |-- targetId: string (nullable = true) |-- mostSevereConsequenceId: string (nullable = true) |-- hgvsId: string (nullable = true) |-- transcriptConsequences: array (nullable = true) | |-- element: struct (containsNull = false) | | |-- variantFunctionalConsequenceIds: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- consequenceScore: float (nullable = true) | | |-- aminoAcidChange: string (nullable = true) | | |-- uniprotAccessions: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- isEnsemblCanonical: boolean (nullable = false) | | |-- codons: string (nullable = true) | | |-- distanceFromFootprint: long (nullable = true) | | |-- distanceFromTss: long (nullable = true) | | |-- appris: string (nullable = true) | | |-- maneSelect: string (nullable = true) | | |-- targetId: string (nullable = true) | | |-- impact: string (nullable = true) | | |-- lofteePrediction: string (nullable = true) | | |-- siftPrediction: float (nullable = true) | | |-- polyphenPrediction: float (nullable = true) | | |-- transcriptId: string (nullable = true) | | |-- transcriptIndex: integer (nullable = false) |-- rsIds: array (nullable = true) | |-- element: string (containsNull = true) |-- alleleFrequencies: array (nullable = false) | |-- element: struct (containsNull = true) | | |-- populationName: string (nullable = true) | | |-- alleleFrequency: double (nullable = true) |-- dbXrefs: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: string (nullable = true) | | |-- source: string (nullable = true) ```

variant_index from gnomad annotations step

``` root |-- variantId: string (nullable = true) |-- chromosome: string (nullable = true) |-- position: integer (nullable = true) |-- referenceAllele: string (nullable = true) |-- alternateAllele: string (nullable = true) |-- inSilicoPredictors: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- method: string (nullable = true) | | |-- assessment: string (nullable = true) | | |-- score: float (nullable = true) | | |-- assessmentFlag: string (nullable = true) | | |-- targetId: string (nullable = true) |-- mostSevereConsequenceId: string (nullable = true) |-- transcriptConsequences: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- variantFunctionalConsequenceIds: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- aminoAcidChange: string (nullable = true) | | |-- uniprotAccessions: array (nullable = true) | | | |-- element: string (containsNull = true) | | |-- isEnsemblCanonical: boolean (nullable = true) | | |-- codons: string (nullable = true) | | |-- distanceFromFootprint: long (nullable = true) | | |-- distanceFromTss: long (nullable = true) | | |-- appris: string (nullable = true) | | |-- maneSelect: string (nullable = true) | | |-- targetId: string (nullable = true) | | |-- impact: string (nullable = true) | | |-- lofteePrediction: string (nullable = true) | | |-- siftPrediction: float (nullable = true) | | |-- polyphenPrediction: float (nullable = true) | | |-- consequenceScore: float (nullable = true) | | |-- transcriptIndex: integer (nullable = true) | | |-- transcriptId: string (nullable = true) |-- rsIds: array (nullable = true) | |-- element: string (containsNull = true) |-- hgvsId: string (nullable = true) |-- alleleFrequencies: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- populationName: string (nullable = true) | | |-- alleleFrequency: double (nullable = true) |-- dbXrefs: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- id: string (nullable = true) | | |-- source: string (nullable = true) ```

[!CAUTION] The difference is the schemas is that the vep based variant_index has array(array(struct(...))) rather then array(struct(...)) in the inSilicoPredictors field

The full analysis of the bug with some example how to overcome it in the dateproc notebook

Expected behaviour Schema of vep based variant index should match variant_index schema

To Reproduce in dataproc cluster

from gentropy.common.session import Session
from gentropy.common.session import Session
from gentropy.config import VariantIndexConfig
from gentropy.dataset.variant_index import VariantIndex
from gentropy.datasource.ensembl.vep_parser import VariantEffectPredictorParser
from gentropy.datasource.open_targets.variants import OpenTargetsVariant
from pyspark.sql import functions as F
from pyspark.sql import types as T

session = Session()
vep_output_json_path = "gs://ot_orchestration/releases/24.09.19/variants/annotated_variants"
variant_index_path = "gs://ot_orchestration/releases/24.09.19/variant_index"
gnomad_variant_annotations_path = "gs://genetics_etl_python_playground/static_assets/gnomad_variants"
hash_threshold = 300

variant_index = VariantEffectPredictorParser.extract_variant_index_from_vep(
        session.spark, vep_output_json_path, hash_threshold
    )
annotations = VariantIndex.from_parquet(
                session=session,
                path=gnomad_variant_annotations_path,
                recursiveFileLookup=True,
            )
variant_index.df.printSchema()
annotations.df.printSchema()

Additional context The schema mismatch was not reported during tests, as the validation of the data frame schema contains bug, that do not allow for such a check - see #3545

project-defiant commented 1 month ago

@DSuveges FYI