Closed eric-czech closed 3 years ago
I'm skipping this. It would be necessary if you wanted to try to rechunk everything quickly but there would be limits to that anyhow since the file downloads are slow and you would have to download up to ~180GB files to parse out small ranges. It is still hypothetically possible to rechunk everything in one day though. It will never be much faster than that.
The current script supports variant ranges for a worker and could write out zarr archives like
ukb_chr{contig}_rng{start}-{stop}.zarr
instead ofukb_chr{contig}.zarr
. This still needs to be wired up to the Snakemake rule though.My sense in running operations against the full zarr archives is that it will be impractical for any kind of analysis. For this project specifically, the sample / variant QC filters reduce the dataset by an order of magnitude in size so it will make most sense to write out a more analysis-specific subset (https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/issues/9). For this reason, having zarr archives specific to variant ranges shouldn't be an issue.