plantinformatics / pretzel

Javascript full-stack framework for Big Data visualisation and analysis
GNU General Public License v3.0
43 stars 13 forks source link

lightweight genotype service #355

Open Don-Isdale opened 1 year ago

Don-Isdale commented 1 year ago

draft architecture :

components :

lightweight genotype service

Pretzel

tasks

selected search results : https://academic.oup.com/database/article/doi/10.1093/database/baz096/5566651 "... HDF5 consistently performed best, in part because it can more efficiently work with both orientations of the allele matrix." "To determine the ideal technology to serve as the backend of the GOBii-GDM, testing was performed using a large genotype-by-sequencing (GBS) dataset (15, 16, 19). Open-source RDBMS, PostgreSQL and MariaDB, a community-developed fork under the GNU GPL of MySQL, were used as a baseline for performance testing and compared with HDF5, MonetDB, Elasticsearch (17), Spark (18), and MongoDB. " https://github.com/gobiiproject/GOBii-System

https://www.intel.in/content/dam/www/public/us/en/documents/white-papers/genomics-storing-genome-data-paper.pdf "... The low-level storage format enables faster and more efficient retrievals from disk compared to the use of files. Additionally, using libraries optimized for Intel® architecture to compress data on disk, GenomicsDB cumulatively achieves orders of magnitude improvement in performance compared to existing tools. In addition, the generalized multi- dimensional array model provides flexibility for GenomicsDB to be extended to other types of genome data. ... "

The lightweight genotype service will need to track large numbers of samples / assays / VCFs, so database capability will be required for that.