lgmgeo / AnnotSV

Annotation and Ranking of Structural Variation
GNU General Public License v3.0
214 stars 35 forks source link

Multiple column and tab errors with CNVkit vcf output #241

Closed GACGAMA closed 2 months ago

GACGAMA commented 3 months ago

Hello! I'm trying to annotated a VCF file produced by CNVkit, using the latest conda version of AnnotSV:

AnnotSV -SvinputFile /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/results/file.vcf -annotationsDir /scratch4/nsobrei2/references/annotsv/AnnotSV_annotations -genomeBuild GRCh38 -outputFile /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/file_annotated.vcf -vcf 1

This works with Manta VCF files, but with CNVKit i'm getting multiple errors:

-- genesAnnotation --
bedtools intersect -sorted -a /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/20240703-141212_AnnotSV_inputSVfile.formatted.sorted.bed -b /scratch4/nsobrei2/references/annotsv/AnnotSV_annotations/Annotations_Human/Genes/GRCh38/genes.RefSeq.sorted.bed -wa -wb > /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/BH11885_1_TUMOR_call_no_theta.cnv_annotated.vcf.tsv.tmp.tmp
Error: unable to open file or unable to determine types for file /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/20240703-141212_AnnotSV_inputSVfile.formatted.sorted.bed

- Please ensure that your file is TAB delimited (e.g., cat -t FILE).
- Also ensure that your file has integer chromosome coordinates in the
  expected columns (e.g., cols 2 and 3 for BED).

And

-- genesAnnotation --
bedtools intersect -sorted -a /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/20240703-141212_AnnotSV_inputSVfile.formatted.sorted.bed -b /scratch4/nsobrei2/references/annotsv/AnnotSV_annotations/Annotations_Human/Genes/GRCh38/genes.RefSeq.sorted.bed -wa -wb > /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/BH10097_1_TUMOR_call_no_theta.cnv_annotated.vcf.tsv.tmp.tmp
Error: line number 6525 of file /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/20240703-141212_AnnotSV_inputSVfile.formatted.sorted.bed has 23 fields, but 13 were expected.

And finally:

-- genesAnnotation --
bedtools intersect -sorted -a /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/20240703-140403_AnnotSV_inputSVfile.formatted.sorted.bed -b /scratch4/nsobrei2/references/annotsv/AnnotSV_annotations/Annotations_Human/Genes/GRCh38/genes.RefSeq.sorted.bed -wa -wb > /scratch4/nsobrei2/ggama1/somatic_SVs/cnvkit/vcfs/annotated/BH9149_TUMOR_call_no_theta_.cnv_annotated.vcf.tsv.tmp.tmp
Error: Type checker found wrong number of fields while tokenizing data line.
Perhaps you have extra TAB at the end of your line? Check with "cat -t"
Exit with error

This is an example of the header of the VCF and one example line:


##fileformat=VCFv4.2
##fileDate=20240702
##source=CNVkit v0.9.11
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=FOLD_CHANGE,Number=1,Type=Float,Description="Fold change">
##INFO=<ID=FOLD_CHANGE_LOG,Number=1,Type=Float,Description="Log fold change">
##INFO=<ID=PROBES,Number=1,Type=Integer,Description="Number of probes in CNV">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=CNV,Description="Copy number variable region">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype quality">
##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events">
##FORMAT=<ID=CNQ,Number=1,Type=Float,Description="Copy number genotype quality for imprecise events">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
chr1    61647885    .   N   <DEL>   .   .   IMPRECISE;SVTYPE=DEL;END=61653737;SVLEN=-5867;FOLD_CHANGE=0.3;FOLD_CHANGE_LOG=-2.5;PROBES=21    GT:GQ   1/1:25

I also validated the VCF format produced by CNVkit with vcftools validate-vcf, which confirms the vcf seems normal. No extra tabs are present also.

lgmgeo commented 3 months ago

Can you attach your VCF input file (described above)? So that I can try to reproduce your bug?

lgmgeo commented 2 months ago

Any news?

GACGAMA commented 2 months ago

Hi @lgmgeo It seems like I can only reproduce the issue while running annotsv in parallel parallel "AnnotSV ..." ::: /scratch/*vcf I'm not sure why. When running in a for loop everything seems fine, with the same files and command I will close as it seems to not be related to AnnotSV!

lgmgeo commented 2 months ago

Thanks for your reply