Closed eric-czech closed 4 years ago
For future reference, the variant info threshold (.8) alone results in a reduction of variants like this for 3 chromosomes:
chrom | variants before | variants after |
---|---|---|
XY | 45906 | 15858 |
22 | 1255683 | 365260 |
21 | 1261158 | 364796 |
That's an average around 30% of variants that are kept.
The code for this now works in two stages, first the variant info filter is applied to create a more generically applicable dataset (that is 30% as big as the original bgen) and then the remaining NealeLab QC filters are applied to create analysis-ready datasets for each chromosome.
# rechunked bgen in optimized representation
40.12 GiB gs://rs-ukb/prep/gt-imputation/ukb_chr21.zarr
# After variant info filter (30% as many variants)
141.78 GiB gs://rs-ukb/prep/gt-imputation-qc/ukb_chr21.zarr
# After all other QC filters
72.86 GiB gs://rs-ukb/pipe/nealelab-gwas-uni-ancestry-v3/input/gt-imputation/ukb_chr21.zarr
I'm not yet sure why the sizes of the archives with so many fewer variants are ballooning like this.
This process is now described well by https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/blob/f997a5f627a1f0be122528e99ebd089823ce14ac/rules/gwas_pipeline.smk.
This can be closed once I'm certain there isn't some particularly wasteful variable or poor configuration setting that is leading to such large resulting archives.
Removing the genotype probability array from the final post-QC archive reduces the size to about half of the original, rather than being ~70% larger. Compression is present so this is the only way to get the size down, and it should be fine since only dosages are required in subsequent steps.
What's the concern with larger file size?
The difference would be something close to 3.5T in GCS, which is cheap but virtually no effort to fix and there is no good reason to expand the scope of the use of that data. The prior archive in the flow (post variant info filter, pre all other filters) is the one that would likely be usable in another context.
After some further updates, the volume associated with the chr21 QC steps and GWAS are currently:
40.12 GiB gs://rs-ukb/prep/gt-imputation/ukb_chr21.zarr
22.39 GiB gs://rs-ukb/prep/gt-imputation-qc/ukb_chr21.zarr
25.35 GiB gs://rs-ukb/pipe/nealelab-gwas-uni-ancestry-v3/input/gt-imputation/ukb_chr21.zarr
182.75 MiB gs://rs-ukb/pipe/nealelab-gwas-uni-ancestry-v3/output/gt-imputation/ukb_chr21
It will be most cost-efficient to store a zarr equivalent to the bgen data as a project-agnostic dataset, but then apply the sample and variant QC filters to create a much smaller subset for downstream analysis. I am imagining a bucket structure like this:
An important decision point here though will be whether or not we keep the zarrs split up by contig or write them out as one big dataset. The autosome archives would merge easily but the X and XY data include differing numbers of samples. We could align them and export a unified dataset, but I think the simplest downstream usage in GWAS will involve a good bit of conditional logic for allosomes anyhow.
I don't actually see many advantages to doing the merge upfront -- we'll have more flexibility and afaik it will make little difference to dask whether array chunks come from one zarr or many.