konradjk / loftee

MIT License
174 stars 55 forks source link

Loss of END_TRUNC in GRCh38 vs GRCh37 due to change in GERP cutoff from +180 to -58 #105

Open daniel-wells opened 5 months ago

daniel-wells commented 5 months ago

In the GRCh37 branch the GERP cutoff is +180 but in the GRCh38 version it was changed to -58: https://github.com/konradjk/loftee/commit/681c5072992fb8c0b6a12c3f78ed55b95fb7e125

This means that variants that are annotated as LC END_TRUNC in LOFTEE GRCh37 change to being HC LoF in GRCh38, even if they're very close to the end of the protein.

An example variant in GRCh37 that's at the very end of the protein gets annotated as LC END_TRUNC (seems sensible):

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
13      32972904        .       CTA     C       .       PASS    cpra=13-32972904-CTA-C
#input_variant_cpra     Consequence     IMPACT  SYMBOL  Gene    Feature BIOTYPE EXON    HGVSc   HGVSp   CANONICAL       SIFT    PolyPhen        DOMAINS LoF     LoF_filter      LoF_flags
13-32972904-CTA-C       frameshift_variant,stop_lost    HIGH    BRCA2   ENSG00000139618 ENST00000380152 protein_coding  -       ENST00000380152.3:c.10255_10256del      ENSP00000369497.3:p.Ter3419SerfsTer18   -       -       -       -       LC      END_TRUNC       -
13-32972904-CTA-C       frameshift_variant,stop_lost    HIGH    BRCA2   ENSG00000139618 ENST00000544455 protein_coding  -       ENST00000544455.1:c.10255_10256del      ENSP00000439902.1:p.Ter3419SerfsTer18   YES     -       -       -       LC      END_TRUNC       -

But the same variant in GRCh38 is annotated as HC LoF:

##fileformat=VCFv4.0
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
13      32398767        .       CTA     C       .       PASS    cpra=13-32398767-CTA
#input_variant_cpra     Consequence     IMPACT  SYMBOL  Gene    Feature BIOTYPE EXON    HGVSc   HGVSp   CANONICAL       SIFT    PolyPhen        DOMAINS LoF     LoF_filter      LoF_flags
13-32398767-CTA-C       frameshift_variant,stop_lost    HIGH    BRCA2   ENSG00000139618 ENST00000380152 protein_coding  -       ENST00000380152.8:c.10255_10256del      ENSP00000369497.3:p.Ter3419SerfsTer18   YES     -       -       -       HC      -       PHYLOCSF_WEAK
13-32398767-CTA-C       frameshift_variant,stop_lost    HIGH    BRCA2   ENSG00000139618 ENST00000544455 protein_coding  -       ENST00000544455.6:c.10255_10256del      ENSP00000439902.1:p.Ter3419SerfsTer18   -       -       -       -       HC      -       PHYLOCSF_WEAK

I can't find exactly how either of the two GERP cutoff values were derived. The GERP scores from the two versions of the files don't look that drastically different.

image image

Is -58 really the correct value for gerp_end_trunc_cutoff in GRCh38? How was it derived?

christopherdkoch commented 3 months ago

If my understanding is correct, I don't think a negative threshold could ever make sense, since what's being considered is the cumulative GERP score. A variant at the very last position will never have a sufficiently negative GERP score to be considered "END_TRUNC". BRCA2 which you show is a good example. In gnomAD v4 (GRCh38), the last two LoF variants are not considered END_TRUNC, but as you move further from the end of the transcript we see some END_TRUNC variants. https://gnomad.broadinstitute.org/gene/ENSG00000139618?dataset=gnomad_r4