ornl-oxford / genben

Benchmarking of software frameworks, and systems for storage and compute over large-scale genomic data.
MIT License
2 stars 3 forks source link

Standardize Genotype Array Between Different Benchmarks #44

Closed eauel closed 5 years ago

eauel commented 5 years ago

This PR standardizes the way the genotype array is created when running the simple aggregations and PCA benchmarks. Previously, the genotype arary type could only be modified via config for the PCA benchmark, whereas the simple aggregations benchmark always used a GenotypeDaskArray object.

The pca_genotype_array_type parameter in the user config has been renamed to genotype_array_type since it now controls the genotype array used for all benchmarks.

The reason for this PR is to start making progress towards steps outlined here: https://github.com/ornl-oxford/genomics-benchmarks/issues/43#issuecomment-454563777. More specifically, only creating a genotype array once per benchmark iteration removes code redundancy and will additionally allow the benchmark to limit the number of variants/samples used for all benchmarks in a future PR. Another reason is to avoid confusion, as the user may not realize that a Dask array was previously being used for the simple aggregations benchmark.

Additionally, I have disabled LD pruning when using a GenotypeDaskArray, and a warning message is printed to the user. LD pruning seems to work when using Dask arrays, but it relies on a serial implementation as discussed previously (https://github.com/ornl-oxford/genomics-benchmarks/issues/39#issuecomment-438835041). Because of this, I decided to disable it as to avoid confusion when benchmarking on larger/distributed systems. In a future PR, LD pruning will have an option in the user config to be enabled/disabled.