jpuntomarcos / CNVfilteR

R package to remove false positives of CNV calling tools by using SNV calls
5 stars 1 forks source link

Missing value error #7

Closed goldpm1 closed 2 years ago

goldpm1 commented 2 years ago

Hi. I'm trying to run your tool for my whole genome germline CNV call. I successfully loaded CNVfiles into cnvs.gr object, and then I tried to load mulitple VCFS. The codes below shows that I tried to load two haplotypecaller vcf files.

cnvs.gr <- **loadCNVcalls**(cnvs.file = InputCNV, chr.column = "chr", start.column = "start", end.column = "end", cnv.column = "SVtype", sample.column = "sample", genome = "hg38")

pp = c()
for (idx in seq(3,4)){
  pp = c(pp, paste0(InputVCFdir,"/CR.", idx, ".snpindel.remdup.VQSR.vcf.gz"))
}
vcfs = **loadVCFs**(vcf.files = pp, cnvs.gr = cnvs.gr, vcf.source="HaplotypeCaller", genome = "hg38")

But, this error message was showed. It seems like missing value issue was arose.

Scanning file /data/project/craniosynostosis/1.VQSR/CR.3.snpindel.remdup.VQSR.vcf.gz...
HaplotypeCaller was found as source in the VCF metadata, AD will be used as allele support field in a list format: ref allele, alt allele.
Error in `[<-.data.frame`(`*tmp*`, mcolumns$alt.freq >= heterozygous.range[1] &  : 
  missing values are not allowed in subscripted assignments of data frames

Can you give me an information or to fix it? Thank you.

jpuntomarcos commented 2 years ago

Hi @goldpm1

Would it be possible for you to provide a VCF example? Of course, a VCF with just some variants and the whole header would be enough.

This way I can test it and fix the code if necessary :)

goldpm1 commented 2 years ago

CR.3.snpindel.remdup.pass.VQSR.txt

remerge.txt

Thank you for your reply. This one is a vcf file that includes header and some calls, total 10000 lines. (Please make sure this files should be transformmed to vcf file) If you need full-length file for this sample (CR.3), I can provide gz and tbi files also.

Each line is little bit verbose because it is 10X-barcoded linked-reads. GATK HaplotypeCaller was successfully run according to LongRanger pipeline. Except for those barcodes (BX tag), the others would be identical to those HaplotypeCaller runs.

The second files is my cnv call (bed file). In some cnv calls, loadVCFs successfully run successfully, but on some files (including that cnv call) the error mentioned above has occurred.

Thanks!!

[Additional comments]

I found the cause of this error while debugging your loadVCF.R Some VCFs have a null alt frequency, even though the exact reason in unknown. So, I would recommend that this single line should be added.

**mcolumns = mcolumns[is.na(mcolumns$alt.freq)==FALSE,]**
mcolumns[mcolumns$alt.freq >= heterozygous.range[1] & mcolumns$alt.freq <= heterozygous.range[2], "type"] <- "ht"
mcolumns[mcolumns$alt.freq >= homozygous.range[1] & mcolumns$alt.freq <= homozygous.range[2], "type"] <- "hm"

Thank You!

jpuntomarcos commented 2 years ago

Hi,

I have tested your example and it worked in my computer. I then modified remerge.txt file to have at least one CNV covering the variants in the CR.3 vcf file, and it also worked. This is the code:

library(CNVfilteR)
cnvs.gr <- loadCNVcalls(cnvs.file = InputCNV, chr.column = "chr", start.column = "start", end.column = "end", cnv.column = "SVtype", sample.column = "sample", genome = "hg38")
vcfs = loadVCFs(vcf.files = vcf.file, cnvs.gr = cnvs.gr, vcf.source="HaplotypeCaller", genome = "hg38")

I am wondering if the problematic VCF is not the one you provided me. Also, are you using last bioconductor and CNVfilteR versions? This is my sessionInfo():

R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=es_ES.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=es_ES.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg38_1.4.4 BSgenome_1.62.0                   rtracklayer_1.54.0                Biostrings_2.62.0                
 [5] XVector_0.34.0                    GenomicRanges_1.46.1              GenomeInfoDb_1.30.1               IRanges_2.28.0                   
 [9] S4Vectors_0.32.3                  BiocGenerics_0.40.0               CNVfilteR_1.8.0
goldpm1 commented 2 years ago

Thank you for your sincere comment and great tool.

My VCF caller (haplotype caller) included some lines which don't include allelic count (alt, depth) which looks weird.

After removing those abnormal calls, the tool worked without problems.

Thanks!

jpuntomarcos commented 2 years ago

You are welcome :)

Anyway, do you think that these lines without AC are an unexpected error of Haplotype caller? If not, I could modify CNVfilteR to deal with these weird lines. In that case, a small failing example would be very appreciated.