Addition of variants not found in original VCF

Hello,

Thank you for this tool, it's been a great help for analysing my data.

I have been running the bam files I generate from WGS through the lumpySV caller and then annotate the variants through AnnotSV3.1 and 2.2 (I use both as 2.2 provides DGV scores as one of the important tools for my analysis and 3.1 provides distance from splice site, again amongst other important filters I use) for a few months now without problem, however recently I noticed that after the annotation process, variants that were not called in the VCF file, were present in the annotated AnnotSV2.2 file. One of these variants is real and not a sequencing artefact based on visual IGV confirmation, however in the original run I was able to see that variant called (even though it wasn't in the VCF file), but when I reran the SVcaller and the annotation, only a low quality/low variant allele frequency variant (this also appears in the original file) appears which gets filtered out by my additional analysis pipeline. The other version of the variant with a high quality/high VAF no longer gets called and I have not been able to replicate the original file. Other variants that appeared in the beginning have also not been replicated, although a few of those were sequencing artefacts. I did not change the script between the different runs, I used the same script. This has only happened for a few patients, with most running normally and I have been able to replicate results for most patients.

I would appreciate any assistance with this matter. I am happy to provide any additional clarification if required.

This is the output file I had for the original file and then the rerun file.

Original:

  AnnotSV 2.2
  Copyright (C) 2017-2019 GEOFFROY Veronique

  Please feel free to contact me for any suggestions or bug reports
  email: veronique.geoffroy@inserm.fr

  Tcl/Tk version: 8.6

  Application name used (defined with the "ANNOTSV" environment variable):
  /scratch/<folder>/<folder>/AnnotSV_2.2/AnnotSV_2.2/

  ...downloading the configuration data (March 14 2023 - 15:37)
      ...configuration data by default
      ...configuration data from /scratch/<folder>/<folder>AnnotSV_2.2/AnnotSV_2.2//etc/AnnotSV/configfile
      ...configuration data given in arguments
      ...checking configuration data and files

  WARNING: No GeneHancer annotations available.
  (Please, see in the README file how to add these annotations. Users need to contact the GeneCards team.)

      ******************************************
      AnnotSV has been run with these arguments:
      ******************************************
      -SVinputFile /scratch/<folder>/<folder>/pID/Variants1/pID.lumpySVtyper.vcf
      -SVinputInfo 1
      -SVminSize 50
      -bedtools bedtools
      -candidateGenesFiltering no
      -genomeBuild GRCh38
      -metrics us
      -minTotalNumber 500
      -organism Human
      -outputDir /scratch/<folder>/<folder>/<pID>
      -outputFile pID_2.2.annotated.tsv
      -overlap 70
      -overwrite yes
      -promoterSize 500
      -rankFiltering 1 2 3 4 5
      -reciprocal no
      -svtBEDcol -1
      -vcfPASS 0
      ******************************************

  ...listing of the annotations to realized (March 14 2023 - 15:38)
      ...refGene annotation
      (with /scratch/<folder>/<folder>/AnnotSV_2.2/AnnotSV_2.2/share/doc/AnnotSV/Annotations_Human/RefGene/GRCh38/refGene.sorted.bed)
      ...Genes-based annotations
          ...20181211_ACMG.tsv
          (59 gene identifiers and 1 annotations columns: ACMG)
          ...20181217_ClinGenAnnotations.tsv.gz
          (1362 gene identifiers and 2 annotations columns: HI_CGscore, TriS_CGscore)
          ...20181210_DDG2P.tsv.gz
          (1788 gene identifiers and 5 annotations columns: DDD_status, DDD_mode, DDD_consequence, DDD_disease, DDD_pmids)
          ...20181218_HI.tsv.gz
          (19124 gene identifiers and 1 annotations columns: HI_DDDpercent)
          ...20181217_GeneIntolerance.pLI-Zscore.annotations.tsv.gz
          (18241 gene identifiers and 3 annotations columns: synZ_ExAC, misZ_ExAC, pLI_ExAC)
          ...20181218_ExAC.CNV-Zscore.annotations.tsv.gz
          (15673 gene identifiers and 3 annotations columns: delZ_ExAC, dupZ_ExAC, cnvZ_ExAC)
          ...20181126_morbidGenes.tsv.gz
          (10586 gene identifiers and 1 annotations columns: morbidGenes)
          ...20181126_morbidGenesCandidates.tsv.gz
          (3040 gene identifiers and 1 annotations columns: morbidGenesCandidates)
          ...20181210_OMIMannotations.tsv.gz
          (13736 gene identifiers and 3 annotations columns: Mim Number, Phenotypes, Inheritance)
      ...Annotations with features overlapping the SV
          ...DGV Gold Standard frequency annotation
          ...1000g frequency annotation
          ...Ira M. Hall's lab frequency annotation
      ...Annotations with features overlapped with the SV
          ...Promoters annotation
          ...dbVar_pathogenic_NR_SV annotation
          ...TAD annotation
      ...Breakpoints annotations
          ...GC content annotation
          ...Repeat annotation

  ...annotation in progress (March 14 2023 - 15:38)
  -- GCcontentAnnotation, nuc --
  bedtools nuc -fi /scratch/<folder>/<folder>/AnnotSV_2.2/AnnotSV_2.2/share/doc/AnnotSV/Annotations_Human/BreakpointsAnnotations/GCcontent/GRCh38/GRCh38_chromFa.fasta -bed /scratch/<folder>/<folder>/pID/20230314_AnnotSV_inputSVfile.formatted.sorted.breakpoints.bed > /scratch/<folder>/<folder>/pID/20230314_AnnotSV_inputSVfile.formatted.sorted.GCcontent.txt
  Feature (M:16468-16668) beyond the length of M size (16569 bp).  Skipping.
  Feature (M:16469-16669) beyond the length of M size (16569 bp).  Skipping.
  Feature (M:16470-16670) beyond the length of M size (16569 bp).  Skipping.

  ...Output columns annotation:
      AnnotSV ID; SV chrom; SV start; SV end; SV length; SV type; ID; REF; ALT; QUAL; FILTER; INFO; FORMAT; 450_175-1; AnnotSV type; Gene name; NM; CDS length; tx length; location; location2; intersectStart; intersectEnd; DGV_GAIN_IDs; DGV_GAIN_n_samples_with_SV; DGV_GAIN_n_samples_tested; DGV_GAIN_Frequency; DGV_LOSS_IDs; DGV_LOSS_n_samples_with_SV; DGV_LOSS_n_samples_tested; DGV_LOSS_Frequency; 1000g_event; 1000g_AF; 1000g_max_AF; IMH_ID; IMH_AF; IMH_ID_others; promoters; dbVar_event; dbVar_variant; dbVar_status; TADcoordinates; ENCODEexperiments; GCcontent_left; GCcontent_right; Repeats_coord_left; Repeats_type_left; Repeats_coord_right; Repeats_type_right; ACMG; HI_CGscore; TriS_CGscore; DDD_status; DDD_mode; DDD_consequence; DDD_disease; DDD_pmids; HI_DDDpercent; synZ_ExAC; misZ_ExAC; pLI_ExAC; delZ_ExAC; dupZ_ExAC; cnvZ_ExAC; morbidGenes; morbidGenesCandidates; Mim Number; Phenotypes; Inheritance; AnnotSV ranking

  ...AnnotSV is done with the analysis (March 14 2023 - 15:46)

Rerun:

AnnotSV 2.2

Copyright (C) 2017-2019 GEOFFROY Veronique

Please feel free to contact me for any suggestions or bug reports
email: veronique.geoffroy@inserm.fr

Tcl/Tk version: 8.6

Application name used (defined with the "ANNOTSV" environment variable):
/scratch/<folder>/<folder>/AnnotSV_2.2/AnnotSV_2.2/

...downloading the configuration data (March 26 2023 - 21:38)
    ...configuration data by default
    ...configuration data from /scratch/<folder>/<folder>/AnnotSV_2.2/AnnotSV_2.2//etc/AnnotSV/configfile
    ...configuration data given in arguments
    ...checking configuration data and files

WARNING: No GeneHancer annotations available.
(Please, see in the README file how to add these annotations. Users need to contact the GeneCards team.)

    ******************************************
    AnnotSV has been run with these arguments:
    ******************************************
    -SVinputFile /scratch/<folder>/<folder>/pID-redone/Variants1/pID.lumpySVtyper.vcf
    -SVinputInfo 1
    -SVminSize 50
    -bedtools bedtools
    -candidateGenesFiltering no
    -genomeBuild GRCh38
    -metrics us
    -minTotalNumber 500
    -organism Human
    -outputDir /scratch/<folder>/<folder>/pID-redone
    -outputFile pID_2.2.annotated.tsv
    -overlap 70
    -overwrite yes
    -promoterSize 500
    -rankFiltering 1 2 3 4 5
    -reciprocal no
    -svtBEDcol -1
    -vcfPASS 0
    ******************************************

...listing of the annotations to realized (March 26 2023 - 21:39)
    ...refGene annotation
    (with /scratch/<folder>/<folder>/AnnotSV_2.2/AnnotSV_2.2/share/doc/AnnotSV/Annotations_Human/RefGene/GRCh38/refGene.sorted.bed)
    ...Genes-based annotations
        ...20181211_ACMG.tsv
        (59 gene identifiers and 1 annotations columns: ACMG)
        ...20181217_ClinGenAnnotations.tsv.gz
        (1362 gene identifiers and 2 annotations columns: HI_CGscore, TriS_CGscore)
        ...20181210_DDG2P.tsv.gz
        (1788 gene identifiers and 5 annotations columns: DDD_status, DDD_mode, DDD_consequence, DDD_disease, DDD_pmids)
        ...20181218_HI.tsv.gz
        (19124 gene identifiers and 1 annotations columns: HI_DDDpercent)
        ...20181217_GeneIntolerance.pLI-Zscore.annotations.tsv.gz
        (18241 gene identifiers and 3 annotations columns: synZ_ExAC, misZ_ExAC, pLI_ExAC)
        ...20181218_ExAC.CNV-Zscore.annotations.tsv.gz
        (15673 gene identifiers and 3 annotations columns: delZ_ExAC, dupZ_ExAC, cnvZ_ExAC)
        ...20181126_morbidGenes.tsv.gz
        (10586 gene identifiers and 1 annotations columns: morbidGenes)
        ...20181126_morbidGenesCandidates.tsv.gz
        (3040 gene identifiers and 1 annotations columns: morbidGenesCandidates)
        ...20181210_OMIMannotations.tsv.gz
        (13736 gene identifiers and 3 annotations columns: Mim Number, Phenotypes, Inheritance)
    ...Annotations with features overlapping the SV
        ...DGV Gold Standard frequency annotation
        ...1000g frequency annotation
        ...Ira M. Hall's lab frequency annotation
    ...Annotations with features overlapped with the SV
        ...Promoters annotation
        ...dbVar_pathogenic_NR_SV annotation
        ...TAD annotation
    ...Breakpoints annotations
        ...GC content annotation
        ...Repeat annotation

...annotation in progress (March 26 2023 - 21:39)
-- GCcontentAnnotation, nuc --
bedtools nuc -fi /scratch/<folder>/<folder>/AnnotSV_2.2/AnnotSV_2.2/share/doc/AnnotSV/Annotations_Human/BreakpointsAnnotations/GCcontent/GRCh38/GRCh38_chromFa.fasta -bed /scratch/<folder>/<folder>/pID-redone/20230326_AnnotSV_inputSVfile.formatted.sorted.breakpoints.bed > /scratch/<folder>/<folder>/pID-redone/20230326_AnnotSV_inputSVfile.formatted.sorted.GCcontent.txt
Feature (M:16468-16668) beyond the length of M size (16569 bp).  Skipping.
Feature (M:16469-16669) beyond the length of M size (16569 bp).  Skipping.
Feature (M:16470-16670) beyond the length of M size (16569 bp).  Skipping.

...Output columns annotation:
    AnnotSV ID; SV chrom; SV start; SV end; SV length; SV type; ID; REF; ALT; QUAL; FILTER; INFO; FORMAT; 450_175-1; AnnotSV type; Gene name; NM; CDS length; tx length; location; location2; intersectStart; intersectEnd; DGV_GAIN_IDs; DGV_GAIN_n_samples_with_SV; DGV_GAIN_n_samples_tested; DGV_GAIN_Frequency; DGV_LOSS_IDs; DGV_LOSS_n_samples_with_SV; DGV_LOSS_n_samples_tested; DGV_LOSS_Frequency; 1000g_event; 1000g_AF; 1000g_max_AF; IMH_ID; IMH_AF; IMH_ID_others; promoters; dbVar_event; dbVar_variant; dbVar_status; TADcoordinates; ENCODEexperiments; GCcontent_left; GCcontent_right; Repeats_coord_left; Repeats_type_left; Repeats_coord_right; Repeats_type_right; ACMG; HI_CGscore; TriS_CGscore; DDD_status; DDD_mode; DDD_consequence; DDD_disease; DDD_pmids; HI_DDDpercent; synZ_ExAC; misZ_ExAC; pLI_ExAC; delZ_ExAC; dupZ_ExAC; cnvZ_ExAC; morbidGenes; morbidGenesCandidates; Mim Number; Phenotypes; Inheritance; AnnotSV ranking

...AnnotSV is done with the analysis (March 26 2023 - 21:44)

Thank you, Safaa

Hi Safaa,

Thank you for your interest in AnnotSV.

recently I noticed that after the annotation process, variants that were not called in the VCF file, were present in the annotated AnnotSV2.2 file.

I'm quite surprised because AnnotSV is an annotator and can not call any SV.... I believe the SV is in your VCF but you don't recognize it. I thought AnnotSV expands the "start" and "end" SV positions with the VCF confidence intervals (CIPOS, CIEND) around the breakpoints, but that's only from the v3.0. So it couldn't be explained with the v2.2. Can you provide me with a VCF input file example? (in order to take a quick look at it)

Best regards, Véronique

Hi Véronique,

Thank you for your quick response!

I was surprised as well because as you mentioned it is not a caller. In terms of searching the VCF, I have searched for quality score, reads and the variant location, however have only found the low quality score variant for the specific example (EXT1).

I have attached the VCF input files, both the original (where in that file I can see the variant) and the redone (the annotation file did not have the variant here), though both have the same number of lines/variants identified in the vcfs. I have also attached the two annotSV2.2 files in case that helps and you will notice there are a lot more variants in the original file than the redone file.

An example from the annotated is EXT1 SV (77 bp del) which is real, however is not being called again other than a low quality score (which as I mentioned before, can also be seen in the original file).

I have just converted them all to txt files to be able to upload, however please let me know if the format opens weirdly.

Thank you.

Kind regards, Safaa

original.lumpySVtyper.txt redone-2.2annotated.txt redone.check.lumpySVtyper.txt original_2.2annotated.txt

In terms of searching the VCF, I have searched for quality score, reads and the variant location, however have only found the low quality score variant for the specific example (EXT1).

Sorry but I don't understand. Nothing to do with AnnotSV, correct? You explain that you filter your VCF before providing it to AnnotSV, right?

I have attached the VCF input files, both the original (where in that file I can see the variant) An example from the annotated is EXT1 SV (77 bp del)

You have more than 20000 variants in your original.lumpySVtyper.txt file... Please, send me an easy VCF input example to debug/understand. I need a VCF input file with a single SV, annotated with v2.2 and not annotated with v3.3.

Not quite, I don't filter the VCF at all, I provide it to AnnotSV directly after the file is produced. I just had a look after the annotation at the original VCF to see if I could find the missing variants that appeared to be annotated by AnnotSV2.2 and I could not.

The "original.lumpySVtyper.txt" is actually the VCF file (I changed it to txt so that I could upload it as VCF file is not supported). have tried to upload the vcf file here again, however it is not accepting it. Is there a different way to send that to you? The VCF file is the one that I then provide to AnnotSV directly as it is.

Just to clarify, I annotate with versions 2.2 and 3.1 separately (the same VCF file is run through both versions at different times).

Thank you.

Ok, I got that. No problem, perfect with the txt file extension.

But you explain that you are missing annotated variants with v3.1. Why did you say that you are missing variants between v2.2 and v3.1?

Did you find a variant example? If yes, can you give me such a variant (the corresponding line in "original.lumpySVtyper.txt")? No need for me to run a VCF with 20000variants, I just want to test with a VCF file containing a problematic variant.
Else, did you have a look at the unannotated output file with v3.1 ?

Best, Véronique

So there are some variants that are being annotated on v2.2 and not on v3.1 and vice versa. For example this variant:

8_118111672_118111749_DEL | 8 | 118111672 | 118111749 | -77 | DEL | 2750 | N | <DEL> | 274.52 | . | SVTYPE=DEL;SVLEN=-77;END=118111749;STRANDS=+-:14;CIPOS=-10,9;CIEND=-10,9;CIPOS95=0,0;CIEND95=0,0;SU=14;PE=0;SR=14 | GT:SU:PE:SR:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:RP:AP:AB | 0/1:14:0:14:146.76:274.52:-34,-7,-58:117:85:32:85:31:12:27:73:4:0.27 | split | EXT1 | NM_000127 | 0 | 78 | exon1-exon1 | 5'UTR | 118111672 | 118111749 |   | 0 | 0 | -1 |   | 0 | 0 | -1 |   | -1 | -1 |   | -1 | 8:19666270-133891475_INV;8:71454834-121935254_INV;8:74949084-131777713_INV;8:108282950-121900292_INV |   |   |   |   |   | 3 | 0 | confirmed/confirmed | monoallelic/monoallelic | loss of function/loss of function | HEREDITARY MULTIPLE EXOSTOSES TYPE 1/TRICHO-RHINO-PHALANGEAL SYNDROME TYPE 2 | 7550340;8981950;15253765;9326317/ | 1.49 | 0.43109281 | 2.5085447 | 0.99842455 | 0.88863009 | 0.72999618 | 0.99224326 | yes |   | 608177 | Chondrosarcoma, 215300 (3)/ Exostoses, multiple, type 1, 133700 (3) | AR/AD | 4
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --

is found annotated in the v2.2 file but not in the v3.1 (the first time I ran it) and I can't actually find it in the VCF file either (I searched for the length and quality score as well). In the original VCF file, line '8416' has the lower quality score version of this variant which is annotated in both v2.2 and v3.1, and is the exact same variant as above, although is called at a lower quality for some reason. I guess part of the question is how is there a different, higher quality version of this exact variant in v2.2 (not in v3.1) which is not called in the VCF file (but is a real variant based on IGV)?

Line 8416: `chr8	118111672	3197	N		25.6	.	SVTYPE=DEL;SVLEN=-77;END=118111749;STRANDS=+-:11;CIPOS=-3,2;CIEND=-3,2;CIPOS95=0,0;CIEND95=0,0;SU=11;PE=0;SR=11	GT:SU:PE:SR:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:RP:AP:AB	0/1:11:0:11:25.60:25.60:-23,-21,-116:176:147:28:146:27:32:21:114:6:0.16

I did have a look at the unannotated file on v3.1 however couldn't find the variant and as I originally mentioned (and was the source of the problem I guess), I haven't been able to replicate this result and obtain the higher quality variant.

Kind regards, Safaa

OK, let's have a look at your variant (8_118111672_118111749_DEL) in your original.lumpySVtyper.txt input file:

chr8    118111672       3197    N       <DEL>   25.60   .       SVTYPE=DEL;SVLEN=-77;END=118111749;STRANDS=+-:11;CIPOS=-3,2;CIEND=-3,2;CIPOS95=0,0;CIEND95=0,0;SU=11;PE=0;SR=11 GT:SU:PE:SR:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:RP:AP:AB      0/1:11:0:11:25.60:25.60:-23,-21,-116:176:147:28:146:27:32:21:114:6:0.16

I created a new VCF input file, with only this variant: test.1variant.txt You can use it to test.

With AnnotSV v3.1, the variant is annotated. But I think I understand why you don't seem to see it in the output...

Here is the v3.1 annotated variant: 8_118111669_118111751_DEL As you can see, the coordinates are not exactly the same. This is because of the CIPOS and the CIEND features (confidence interval around the breakpoint): CIPOS=-3,2;CIEND=-3,2; If you want to keep the input coordinates, please set the -includeCI to 0. Does this resolve your understanding of "missing" variants?

In the original VCF file, line '8416' has the lower quality score version of this variant

The quality score is assigned to a variant by the SV caller, based on the BAM file. No link with AnnotSV.

Thank you for testing that. My apologies to clarify, I do see the variant with the quality score of 25.6 in both v2.2 and v3.1 (which I understand is assigned by BAM and not AnnotSV). I have noticed that the coordinates differ slightly but thank you for the suggestion of the input coordinates setting! My question is regarding the same variant which also appears in v2.2 annotated file as a different variant with a different number of reads and a different quality score. This variant for example appears in v2.2 but not 3.1 however is the exact variant as the 25.6 quality score variant and it doesn't appear in my VCF file as a variant with the score of 274.52 even though it is in the annotated v2.2 file.

8_118111672_118111749_DEL | 8 | 118111672 | 118111749 | -77 | DEL | 2750 | N | <DEL> | 274.52 | . | SVTYPE=DEL;SVLEN=-77;END=118111749;STRANDS=+-:14;CIPOS=-10,9;CIEND=-10,9;CIPOS95=0,0;CIEND95=0,0;SU=14;PE=0;SR=14 | GT:SU:PE:SR:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:RP:AP:AB | 0/1:14:0:14:146.76:274.52:-34,-7,-58:117:85:32:85:31:12:27:73:4:0.27 | split | EXT1 | NM_000127 | 0 | 78 | exon1-exon1 | 5'UTR | 118111672 | 118111749 |   | 0 | 0 | -1 |   | 0 | 0 | -1 |   | -1 | -1 |   | -1 | 8:19666270-133891475_INV;8:71454834-121935254_INV;8:74949084-131777713_INV;8:108282950-121900292_INV |   |   |   |   |   | 3 | 0 | confirmed/confirmed | monoallelic/monoallelic | loss of function/loss of function | HEREDITARY MULTIPLE EXOSTOSES TYPE 1/TRICHO-RHINO-PHALANGEAL SYNDROME TYPE 2 | 7550340;8981950;15253765;9326317/ | 1.49 | 0.43109281 | 2.5085447 | 0.99842455 | 0.88863009 | 0.72999618 | 0.99224326 | yes |   | 608177 | Chondrosarcoma, 215300 (3)/ Exostoses, multiple, type 1, 133700 (3) | AR/AD | 4
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --

My question is regarding the same variant which also appears in v2.2 annotated file as a different variant with a different number of reads and a different quality score.

I can't reproduce that with v2.2 and original.lumpySVtyper.txt.

AnnotSV v2.2 only reports the SV with the quality score of 25.6:

Are you really sure to use the same input file???

Moreover, if you test the V2.2 with my test.1variant.vcf file, you will see that no new variant is created. (which wouldn't make sense since AnnotSV is not a caller)

Sorry but I won't be able to find time to research your other variant, especially with such an old version (v2.2).

That is my problem as well, I have not been able to reproduce it even though I have rerun the pipeline from the beginning (i.e. redoing both lumpy and the alignment) as well as annotating with v2.2 and the variant has not come up (nor other variants that don't appear anymore), however this is a real variant so it should be called (that is more of a VCF problem than annotsv though)... It was the same file as I did not change the script at all and I have checked it. The strange thing is that it has happened with a few patients but the rest have been fine and I haven't been able to reproduce it in any of the patients that had different variants added so I was hoping to see if you might have an idea as to why they are all of a sudden in the file when they shouldn't be called but it is a strange issue.

No problem, thank you for trying to troubleshoot and checking that out and for your help!

Kind regards, Safaa

lgmgeo / AnnotSV

Addition of variants not found in original VCF #167

Original:

Rerun: