Open gaow opened 5 years ago
BTW if I revert to previous version the database size is:
-rw-r--r-- 1 student student 9.8M Mar 22 18:05 demo_genotype.DB
which is roughly 3.1M + 5.4M? Not sure what it is with the multi_genes.h5
.
Ideas to reduce size of genotype file include 1) better compression method in HDF5 and 2) use numpy.float16 for genotype storage (better not use int in case there is imputated data)
Here I compare size of the VCF input data and the genotype database generated:
I think the genotype data is unreasonably large ... isn't it?
BTW this is result running this notebook:
https://github.com/gaow/ismb-2018/blob/dev/VAT-ISMB-2018.ipynb