vatlab / varianttools

software tool for the manipulation, annotation, selection, and analysis of variants in the context of next-gen sequencing analysis
https://vatlab.github.io/vat-docs/
GNU General Public License v3.0
31 stars 4 forks source link

Genotype annotations #26

Open gaow opened 7 years ago

gaow commented 7 years ago

A typical genotype entry looks like:

0/0:43,0:43:92:0,92,1267

The first part 0/0 is the actual genotype; the others are genotype annotations. In our current implementation (vat 2.0 hereafter) we import GT by default and others optional. We do import everything because we want to be able to create filters when performing quality control or calculating summary stats.

However in many scenarios the genotype data have already being QC-ed. Also we may start from un-QC-ed genotype data, yet after QC we'll no longer need those other genotype information. That is when we may want to create new projects that only keeps the GT info.

Can we make each field in genotype data a separate data matrix? For example we have a project that looks like:

project.variants
project.GT
project.DP

And our filtering would be

vtools samples <various geno_info based filtering> -t project.gmask
vtools select project.gmask project.GT ..

where gmask is a sparse matrix of zero or ones. Zero means the entry is to be excluded, one means to be included, in computing other statistics.

gaow commented 7 years ago

BTW, genotype information is important for QC, but people usually just do QC first and stick to the VCF file after QC. I have a standard gzipped VCF file that is 130GB including all genotype info, for WGS of 650 samples. But after QC and only keeping GT the zippd file size becomes 2.7GB. I can imaging it may be even smaller if we use sparse matrix for it.