knausb / vcfR

Tools to work with variant call format files
240 stars 54 forks source link

0 percent missing data #190

Closed redgcko7 closed 2 years ago

redgcko7 commented 2 years ago

Hello, I am wondering why the missing data percent is always reported as zero after importing a .vcf file as an object and then printing the object name to show its contents. I have tried multiple vcf files where I know there is not zero missing data. Is this missing data here not referring to the SNP matrix, but some other type of data? Thanks

Example:

vcf <- read.vcfR("example.vcf")
vcf
***** Object of Class vcfR *****
21 samples
249 CHROMs
685 variants
Object size: 0.4 Mb
0 percent missing data
*****        *****         *****
knausb commented 2 years ago

Hola @redgcko7 , yeah, missing data provided by the show() method has been a bit of a challenge. I see the show() method as a quick way to get information about an object. Sample and variant counts are important here to validate you've read in an entire VCF file, or perhaps that your file is incomplete. Having a summary of how much missing data you have would be nice too, right? I'm afraid that the short answer is that it's complicated.

A VCF record that contains a missing genotype may appear as follows.

FORMAT       Sample1
"GT:GQ:DP" "./.:0:1

Here the genotype is missing (./.) but there is associated information as well, such as the genotype quality and depth of sequencing. The associated information may provide information on why the genotypic state was inferred as reported. Here we have zero sequencing depth so calling a missing genotype may seem reasonable. However, from a computing perspective we actually have data here (a string). In order to determine that this is a missing genotype we need to isolate the genotype with extract.gt() which should convert "./." to NA, because NA is a concept that exists in R unlike many other languages. Once we have a matrix of genotypes with NA for missing data we can query it with is.na(). This all comes with a performance cost. In the context of the show() method it does not make sense to me to incur this performance cost because it doesn't return anything. I think that if you're going to extract a matrix of genotypes you might as well keep it for other things. So the show() method is an attempt to get a quick check on what your data look like. If you really want to know how many missing genotypes you have you'll want to extract the genotypes. If you have a better solution let me know!

redgcko7 commented 2 years ago

That makes sense, thanks for your helpful reply!