knausb / vcfR

Tools to work with variant call format files
240 stars 54 forks source link

vcfR not recognizing all chromosomes #196

Closed lukaskon closed 2 years ago

lukaskon commented 2 years ago

When I try to read in my filtered vcf file, it only recognizes 16 of the 18 chromosomes. I unzipped and checked with head(filtered.vcf) and it says it has 18. Any idea what the reason for this would be? Size related? Happy to provide any additional information.

> vcf <- read.vcfR("filtered.vcf.gz", verbose = FALSE)
> vcf
***** Object of Class vcfR *****
96 samples
16 CHROMs
365,842 variants
Object size: 464.2 Mb
0 percent missing data
*****        *****         *****
$ head -50 filtered.vcf

##contig=<ID=1,length=4109373>
##contig=<ID=2,length=3341473>
##contig=<ID=3,length=3226611>
##contig=<ID=4,length=2468882>
##contig=<ID=5,length=2959378>
##contig=<ID=6,length=2725906>
##contig=<ID=7,length=2652353>
##contig=<ID=8,length=2617329>
##contig=<ID=9,length=2547566>
##contig=<ID=10,length=2419276>
##contig=<ID=11,length=2359939>
##contig=<ID=12,length=2352958>
##contig=<ID=13,length=2257609>
##contig=<ID=14,length=2138025>
##contig=<ID=15,length=2027721>
##contig=<ID=16,length=1969743>
##contig=<ID=17,length=247158>
##contig=<ID=18,length=208766>

If I ignore and create a chrom object:

chrom <- create.chromR(name="Supercontig", vcf=vcf, seq=dna, ann=gff, verbose=TRUE)
vcfR object includes more than one chromosome (CHROM).
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
Subsetting to the first chromosome
DNAbin object includes more than one chromosome.
1 dna:chromosome chromosome:ASM83294v1:1:1:4109373:1 REF, 2 dna:chromosome chromosome:ASM83294v1:2:1:3341473:1 REF, 3 dna:chromosome chromosome:ASM83294v1:3:1:3226611:1 REF, 4 dna:chromosome chromosome:ASM83294v1:4:1:2468882:1 REF, 5 dna:chromosome chromosome:ASM83294v1:5:1:2959378:1 REF, 6 dna:chromosome chromosome:ASM83294v1:6:1:2725906:1 REF, 7 dna:chromosome chromosome:ASM83294v1:7:1:2652353:1 REF, 8 dna:chromosome chromosome:ASM83294v1:8:1:2617329:1 REF, 9 dna:chromosome chromosome:ASM83294v1:9:1:2547566:1 REF, 10 dna:chromosome chromosome:ASM83294v1:10:1:2419276:1 REF, 11 dna:chromosome chromosome:ASM83294v1:11:1:2359939:1 REF, 12 dna:chromosome chromosome:ASM83294v1:12:1:2352958:1 REF, 13 dna:chromosome chromosome:ASM83294v1:13:1:2257609:1 REF, 14 dna:chromosome chromosome:ASM83294v1:14:1:2138025:1 REF, 15 dna:chromosome chromosome:ASM83294v1:15:1:2027721:1 REF, 16 dna:chromosome chromosome:ASM83294v1:16:1:1969743:1 REF, 17 dna:chromosome chromosome:ASM83294v1:17:1:247158:1 REF, 18 dna:chromosome chromosome:ASM83294v1:18:1:208766:1 REF
Subsetting to the first chromosome
Annotations include more than one chromosome.
1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 2, 3, 4, 5, 6, 7, 8, 9
Subsetting to the first chromosome
Names in vcf:
  1
Names of sequences:
  1 dna:chromosome chromosome:ASM83294v1:1:1:4109373:1 REF
Names in annotation:
  1
Initializing var.info slot.
var.info slot initialized.
Warning message:
In create.chromR(name = "Supercontig", vcf = vcf, seq = dna, ann = gff,  : 
        Names in variant data and sequence data do not match perfectly.
        If you choose to proceed, we'll do our best to match the data.
        But prepare yourself for unexpected results.
knausb commented 2 years ago

Is it possible that you do not have variants for some of your chromosomes? If your reference had 18 chromosomes you should see that in the meta region. But that does not mean you have variants for all of those chromosomes. If no variants were called for two of your chromosomes, or your filtering removed them, you'll only have variants for 16 chromosomes. Something like the following may help.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0.9999 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
data("vcfR_test")

vcfR_test@meta <- c(
  vcfR_test@meta[1:5],
  c("##contig=<ID=1,length=4109373>",
  "##contig=<ID=2,length=3341473>",
  "##contig=<ID=3,length=3226611>",
  "##contig=<ID=4,length=2468882>",
  "##contig=<ID=5,length=2959378>",
  "##contig=<ID=6,length=2725906>",
  "##contig=<ID=7,length=2652353>",
  "##contig=<ID=8,length=2617329>",
  "##contig=<ID=9,length=2547566>",
  "##contig=<ID=10,length=2419276>",
  "##contig=<ID=11,length=2359939>",
  "##contig=<ID=12,length=2352958>",
  "##contig=<ID=13,length=2257609>",
  "##contig=<ID=14,length=2138025>",
  "##contig=<ID=15,length=2027721>",
  "##contig=<ID=16,length=1969743>",
  "##contig=<ID=17,length=247158>",
  "##contig=<ID=18,length=208766>"),
  vcfR_test@meta[6:18]
  )

# Your 18 plus one.
length(grep("contig", vcfR_test@meta))
#> [1] 19
length(unique(getCHROM(vcfR_test)))
#> [1] 1

Created on 2022-03-08 by the reprex package (v2.0.1)

Does this address your issue?

lukaskon commented 2 years ago

I should have checked the meta region, thank you! That was exactly it.