jpuntomarcos / CNVfilteR

R package to remove false positives of CNV calling tools by using SNV calls
5 stars 1 forks source link

Loading vcf file using loadVCFs #8

Closed dpacohen closed 2 years ago

dpacohen commented 2 years ago

Hi, I am trying to load my own VCF using loadVCFs:

CNV.GR <- loadCNVcalls(cnvs.file = "E:/Documents/CNV/Pre/D366R28.csv", chr.column = "Chr", start.column = "Start", end.column = "Stop", cnv.column = "Status", sample.column = "Barcode", deletion = c("Hemizygous deletion", "Homozygous deletion"), duplication = c("Gain","Amplification"), ignore.unexpected.rows = T, sep = ",", genome = "hg19")

vcf <- loadVCFs(vcf.files = "E:/Documents/VCF/D366R28.vcf", cnvs.gr = CNV.GR)

but I get an error saying: Scanning file C:\Users\David\AppData\Local\Temp\RtmpsZ1RZI/D366R28.vcf.gz... VarScan2 was found as source in the VCF metadata, RD will be used as ref allele depth field, AD will be used as alt allele depth field. Error in loadSNPsFromVCF(vcf.file = vcfFile, verbose = verbose, vcf.source = vcf.source, : dims [product 6738] do not match the length of object [13476]

Could you help me with this error?

jpuntomarcos commented 2 years ago

Hi,

Would you please provide a testing version of the CNV and VCF files? This way I can reproduce the error and identify the problem :)

Regards, J.

dpacohen commented 2 years ago

Hi, Here are my VCF and CNV files D366R28.zip

Best,

D

jpuntomarcos commented 2 years ago

Hi David,

I checked the VCF you provided and it does not seem to be properly formated. Although the source is Varscan2, your AD field is different from the common Varscan output.

This is the common AD field for Varscan ##FORMAT=

And this is the one in your VCF: ##FORMAT= But then, if you check the VCF, there is only one value for each AD field.

Since CNVfilteR is expecting single values in AD field but it is defined as Ref/alt list, it crashes. Did you use any tool that may be modifying the VCF header (for example varscan_vcf_remap)?

You can rebuild your VCF to meet VCF format. Alternatively, as a workaround, change R to 1 in the AD field and it should work.

Hope it helps, J.

dpacohen commented 2 years ago

Hi J, We get the VCF files from another university. The way they call the variant is as follows: Variant calling of both single nucleotide variations (SNVs) and small insertion/deletions (indels) was then performed on the processed alignment files using a combination of the mpileup module of SAMtools - taking into account anomalous read pairs, without base quality recalibration, considering a minimum mapping quality of 0 and base quality of 15, and a maximum read depth of 1M – and VarScan2 mpileup2cns (v2.4.3, Koboldt et al., 2009). Variants were reported if the number of reads supporting the alternative allele was superior or equal to 2, at a locus covered by at least 10 reads, and if the allelic ratio of this variant was superior or equal to 1%. When available, Varscan2 somatic mode was used to call variants using both tumor and matched-germline samples with similar parameters. A bug of VarScan2 mpileup2cns prevents SNVs presenting multi-allelic positions to be called: mpileup2snp was therefore used to catch these alterations that were then added to the variants call using an in-house R script.

Bcftools (v1.3.1, Li et al., 2011) “norm” module was then used to split multi-allelic variants into several rows and to left-align and normalize indels. Using a suite of in-house python (v2.7.13) scripts, variants were finally annotated with different metrics regarding variants context inside the reads (depth of coverage, strand bias, median base qualities, variant position...).

Is there an easy way to rebuild the VCF according to the VCF format or should I just apply the workaround method as you have mentioned?

Anyway many thanks for helping me out!

D

jpuntomarcos commented 2 years ago

You are welcome :)

Ideally I would let the university know that the VCF is not properly formated for the AD field, so they can check when and why it happens.

But if it is not practical, replacing R to 1 in that line solves the problem.

dpacohen commented 2 years ago

OK, thanks again!