Recommendations about chunk sizes

sgkit-dev / vcf-zarr-spec

VCF Zarr Specification

Apache License 2.0

11 stars 2 forks source link

Recommendations about chunk sizes #22

Open jeromekelleher opened 3 months ago

jeromekelleher commented 3 months ago

We currently say nothing at all about chunk sizes, but I think we will need to provide some rules/guidance in order to make processing arrays efficient. For example, it really does help a lot of call-level arrays all have the same chunking (in the variants and samples dimension) so that code can read in (say) genotypes and DP values chunk-by-chunk in the same loop.

Currently vcf2zarr enforces a uniform chunk size across dimensions, so that we have one variants_chunk_size. While this is a useful simplification, it does have some drawbacks, particularly when we want to read in all of a low-dimensional array at once (e.g., ``variant_position). See #21 for discussion and some benchmarks on this point.

This would need some feedback from a variety of implementations and use-cases, I think.

tomwhite commented 1 month ago

Fields like variant_position and variant_contig are essentially coordinate indexes, so there is a case for storing them in a single chunk since all values need to be accessible at once. (Xarray for example reads all coordinates into memory.)

I don't know of any cases where they need to have the same chunking as other variant fields, but if there are any then it should be straighforward to rechunk to lots of chunks (easier than doing the reverse of reading lots of chunks into a single array as #21 showed).

@jeromekelleher, how easy would it be to adapt vcf2zarr to store these fields in single chunk Zarr arrays?

jeromekelleher commented 1 month ago

@jeromekelleher, how easy would it be to adapt vcf2zarr to store these fields in single chunk Zarr arrays?

Very easy. But let's consider the limit cases here first: what size array would we need to store an entire human genome where we've got calls at every base? (We're approaching this limit with large data dataset)

For 3.1Gb we get a variant_position array of around 12GB - so reading that in a single chunk just isn't feasible, and certainly not as a low-latency way of getting at small chunks of data. We will have to tackle the proper indexing of the (contig, position) values at some point (#21, #23), and I think it's probably best if we do so now to support vcztools view.

tomwhite commented 1 month ago

I agree. Let's use vcztools view to try different implementations - standardization can come later.