Open gaow opened 7 years ago
BTW, genotype information is important for QC, but people usually just do QC first and stick to the VCF file after QC. I have a standard gzipped VCF file that is 130GB including all genotype info, for WGS of 650 samples. But after QC and only keeping GT the zippd file size becomes 2.7GB. I can imaging it may be even smaller if we use sparse matrix for it.
A typical genotype entry looks like:
0/0:43,0:43:92:0,92,1267
The first part
0/0
is the actual genotype; the others are genotype annotations. In our current implementation (vat 2.0 hereafter) we import GT by default and others optional. We do import everything because we want to be able to create filters when performing quality control or calculating summary stats.However in many scenarios the genotype data have already being QC-ed. Also we may start from un-QC-ed genotype data, yet after QC we'll no longer need those other genotype information. That is when we may want to create new projects that only keeps the GT info.
Can we make each field in genotype data a separate data matrix? For example we have a project that looks like:
And our filtering would be
where
gmask
is a sparse matrix of zero or ones. Zero means the entry is to be excluded, one means to be included, in computing other statistics.