nf-cmgg / structural

A bioinformatics best-practice analysis pipeline for calling structural variants (SVs), copy number variants (CNVs) and repeat region expansions (RREs) from short DNA reads
https://nf-cmgg.github.io/structural/
MIT License
18 stars 3 forks source link

include gnomADv4 SV annotation #66

Open mvheetve opened 9 months ago

mvheetve commented 9 months ago

gnomAD v4 SV fields for annotation

I went over the gnomAD v4 SV header and selected a few fields that seem relevant for the annotation step. Here is an overview, we can discuss during the meeting next week.

  1. All info fields can be selected for the whole dataset, or for subsets (from gnomADv3), using these prefixes:

    • controls_and_biobanks_ only samples collected specifically as controls for disease studies, or samples belonging to biobanks (e.g. BioMe, Genizon) or general population studies (e.g., 1000 Genomes, HGDP, PAGE)
    • non_neuro_ only samples that were not collected as part of a neurologic or psychiatric case/control study, or samples collected as part of a neurologic or psychiatric case/control study but designated as controls
  2. All info fields can be selected for the whole dataset, or for populations (from gnomADv3), using these prefixes:

    • afr_ African/African American
    • ami_ Amish
    • amr_ Latino/Admixed American
    • asj_ Ashkenazi Jewish
    • eas_ East Asian
    • fin_ Finnish
    • nfe_ Non-Finnish European
    • mid_ Middle Eastern
    • sas_ South Asian
    • oth_ Other (population not assigned)

for 1 and 2: I don't expect us to differentiate at this point, but I'm putting it out there for any future implementations.

  1. For annotation :
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=CHR2,Number=1,Type=String,Description="Chromosome for END coordinate">    
##INFO=<ID=POS2,Number=1,Type=Integer,Description="Start position of the structural variant on CHR2">
##INFO=<ID=END2,Number=1,Type=Integer,Description="End position of the structural variant on CHR2">

The following seem interesting to me, but should maybe be evaluated first for relevance and performance:

##INFO=<ID=PREDICTED_BREAKEND_EXONIC,Number=.,Type=String,Description="Gene(s) for which the SV breakend is predicted to fall in an exon.">
##INFO=<ID=PREDICTED_COPY_GAIN,Number=.,Type=String,Description="Gene(s) on which the SV is predicted to have a copy-gain effect.">
##INFO=<ID=PREDICTED_DUP_PARTIAL,Number=.,Type=String,Description="Gene(s) which are partially overlapped by an SV's duplication, but the transcription start site is not duplicated.">
##INFO=<ID=PREDICTED_INTERGENIC,Number=0,Type=Flag,Description="SV does not overlap any protein-coding genes.">
##INFO=<ID=PREDICTED_INTRAGENIC_EXON_DUP,Number=.,Type=String,Description="Gene(s) on which the SV is predicted to result in intragenic exonic duplication without breaking any coding sequences.">
##INFO=<ID=PREDICTED_INTRONIC,Number=.,Type=String,Description="Gene(s) where the SV was found to lie entirely within an intron.">
##INFO=<ID=PREDICTED_INV_SPAN,Number=.,Type=String,Description="Gene(s) which are entirely spanned by an SV's inversion.">
##INFO=<ID=PREDICTED_LOF,Number=.,Type=String,Description="Gene(s) on which the SV is predicted to have a loss-of-function effect.">
##INFO=<ID=PREDICTED_MSV_EXON_OVERLAP,Number=.,Type=String,Description="Gene(s) on which the multiallelic SV would be predicted to have a LOF, INTRAGENIC_EXON_DUP, COPY_GAIN, DUP_PARTIAL, TSS_DUP, or PARTIAL_EXON_DUP annotation if the SV were biallelic.">
##INFO=<ID=PREDICTED_NEAREST_TSS,Number=.,Type=String,Description="Nearest transcription start site to an intergenic variant.">
##INFO=<ID=PREDICTED_PARTIAL_EXON_DUP,Number=.,Type=String,Description="Gene(s) where the duplication SV has one breakpoint in the coding sequence.">
##INFO=<ID=PREDICTED_PROMOTER,Number=.,Type=String,Description="Gene(s) for which the SV is predicted to overlap the promoter region.">
##INFO=<ID=PREDICTED_TSS_DUP,Number=.,Type=String,Description="Gene(s) for which the SV is predicted to duplicate the transcription start site.">
##INFO=<ID=PREDICTED_UTR,Number=.,Type=String,Description="Gene(s) for which the SV is predicted to disrupt a UTR.">
##INFO=<ID=SOURCE,Number=1,Type=String,Description="Source of inserted sequence.">
##INFO=<ID=STRANDS,Number=1,Type=String,Description="Breakpoint strandedness [++,+-,-+,--]">

To be continued...