selected search results :
https://academic.oup.com/database/article/doi/10.1093/database/baz096/5566651
"... HDF5 consistently performed best, in part because it can more efficiently work with both orientations of the allele matrix."
"To determine the ideal technology to serve as the backend of the GOBii-GDM, testing was performed using a large genotype-by-sequencing (GBS) dataset (15, 16, 19). Open-source RDBMS, PostgreSQL and MariaDB, a community-developed fork under the GNU GPL of MySQL, were used as a baseline for performance testing and compared with HDF5, MonetDB, Elasticsearch (17), Spark (18), and MongoDB. "
https://github.com/gobiiproject/GOBii-System
https://www.intel.in/content/dam/www/public/us/en/documents/white-papers/genomics-storing-genome-data-paper.pdf
"... The low-level storage format
enables faster and more efficient retrievals from disk compared to the use of files.
Additionally, using libraries optimized for Intel® architecture to compress data
on disk, GenomicsDB cumulatively achieves orders of magnitude improvement
in performance compared to existing tools. In addition, the generalized multi-
dimensional array model provides flexibility for GenomicsDB to be extended to
other types of genome data. ... "
check whether : one DivBrowse server provides access to a single .vcf.gz;
The lightweight genotype service will need to track large numbers of samples / assays / VCFs, so database capability will be required for that.
draft architecture :
components :
lightweight genotype service
Pretzel
tasks
selected search results : https://academic.oup.com/database/article/doi/10.1093/database/baz096/5566651 "... HDF5 consistently performed best, in part because it can more efficiently work with both orientations of the allele matrix." "To determine the ideal technology to serve as the backend of the GOBii-GDM, testing was performed using a large genotype-by-sequencing (GBS) dataset (15, 16, 19). Open-source RDBMS, PostgreSQL and MariaDB, a community-developed fork under the GNU GPL of MySQL, were used as a baseline for performance testing and compared with HDF5, MonetDB, Elasticsearch (17), Spark (18), and MongoDB. " https://github.com/gobiiproject/GOBii-System
https://www.intel.in/content/dam/www/public/us/en/documents/white-papers/genomics-storing-genome-data-paper.pdf "... The low-level storage format enables faster and more efficient retrievals from disk compared to the use of files. Additionally, using libraries optimized for Intel® architecture to compress data on disk, GenomicsDB cumulatively achieves orders of magnitude improvement in performance compared to existing tools. In addition, the generalized multi- dimensional array model provides flexibility for GenomicsDB to be extended to other types of genome data. ... "
The lightweight genotype service will need to track large numbers of samples / assays / VCFs, so database capability will be required for that.