vatlab / varianttools

software tool for the manipulation, annotation, selection, and analysis of variants in the context of next-gen sequencing analysis
https://vatlab.github.io/vat-docs/
GNU General Public License v3.0
31 stars 4 forks source link

Size of genotype database #99

Open gaow opened 5 years ago

gaow commented 5 years ago

Here I compare size of the VCF input data and the genotype database generated:

[GW] ll *.vcf.gz
-rw-rw-r-- 1 gaow gaow 977K Mar 22 11:27 YRI.exon.2010_03.genotypes.vcf.gz
-rw-rw-r-- 1 gaow gaow 593K Mar 22 11:27 CEU.exon.2010_03.genotypes.vcf.gz
[GW] ll *.h5
-rw-rw-r-- 1 gaow gaow 3.1M Mar 22 11:38 tmp_1_90_genotypes.h5
-rw-rw-r-- 1 gaow gaow 5.4M Mar 22 11:38 tmp_91_202_genotypes.h5
-rw-rw-r-- 1 gaow gaow  11M Mar 22 11:39 tmp_1_90_genotypes_multi_genes.h5

I think the genotype data is unreasonably large ... isn't it?

BTW this is result running this notebook:

https://github.com/gaow/ismb-2018/blob/dev/VAT-ISMB-2018.ipynb

gaow commented 5 years ago

BTW if I revert to previous version the database size is:

-rw-r--r-- 1 student student 9.8M Mar 22 18:05 demo_genotype.DB

which is roughly 3.1M + 5.4M? Not sure what it is with the multi_genes.h5.

gaow commented 5 years ago

Ideas to reduce size of genotype file include 1) better compression method in HDF5 and 2) use numpy.float16 for genotype storage (better not use int in case there is imputated data)