GWAS Analysis
GWAS analysis research with feature comparisons, benchmarks, and ports of published analyses for several genetic data analysis platforms.
This work is primarily research for gwas-analysis-manuscript (which is just stub files and some notes at this point) and here is a summary of the contents with relevant descriptions/links:
- Genomic Toolkit Comparison: This spreadsheet is not in the repo, but it provides important context for common genetic data toolkits
- notebooks/tutorial: Implementations of the GWAS tutorial in Marees et al. 2018 in Glow, Hail, and PLINK
- notebooks/organism/canine: This analysis reformulates the first half of the UKBB QC process in Bycroft et al. 2018 for a canine dataset published by Hayward et al. 2016. This analysis also uses a separate dataset from the NHGRI Dog Genome Project as an analog to the 1KG data often used in population stratification for QC (as is common with UKBB pipelines).
- notebooks/platform/dask: This library prototype benchmarks simple GWAS QC operations using Dask as well as I/O with custom bit-packing and compression via Zarr. Dask gives large improvements on column-wise/row-wise operation times vs Hail/Glow and nearly matches the performance of PLINK on a single host (the others take at least ~5x longer).
- notebooks/benchmark/method: This directory contains implementations of genetic data processing methods implemented over Dask arrays (currently only LD pruning). These experiments also apply locality sensitive hashing as well as an optimization using the triangle inequality to LD prune algorithms in an attempt to demonstrate representative workloads with Dask. The latter of this is very similar to scikit-allele.locate_unlinked but uses
numba.jit
compilation and dask.array.map_overlap
.
- notebooks/platform/xarray: This prototype library gives a specification for a top level API over genetic data structures (it is complementary to skallel). The data_structures.ipynb notebook shows example use cases for this API.