sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
232 stars 32 forks source link

Create bitpacking filter for biallelic diploid datasets #80

Open eric-czech opened 4 years ago

eric-czech commented 4 years ago

We should add a bitpacking numcodecs filter for biallelic diploid data since it makes for a substantial improvement over zarr's default compression. Here's an example: https://nbviewer.jupyter.org/github/related-sciences/gwas-analysis/blob/master/notebooks/platform/xarray/io.ipynb (at the end, results are ~%40 smaller).

That same filter won't work since calls are in a 3D array now, so we'll have to rethink the packing scheme.

hammer commented 4 years ago

@alimanfoo @daletovar we thought this task might be a good one for Quansight to pick up. Any interest?

jeromekelleher commented 4 years ago

Sounds like a good idea to me. This does sound like something that should be in numcodecs, though, right?

Also, it doesn't feel like this is on the critical path for getting an initial release done - it will make our on-file storage a bit more compact, but won't affect anything else much, right?

eric-czech commented 4 years ago

This does sound like something that should be in numcodecs, though, right?

Possibly? An alternative that made less sense when we were working with PLINK data directly (as alternate allele counts) would be to encode our calls as booleans and use PackBits as is. This is pretty straightforward and putting a little thought into the syntax would be something like:

# Convert  unphased diploid biallelic calls to 2-bit bools
hap0, hap1 = ds.call_genotype[..., 0], ds.call_genotype[..., 1]
bit0 = (hap0 < 0) | (hap0 != hap1) # Missing or heterozygous
bit1 = hap0 == 0 # Homozygous major or minor
ds['call_genotype_bits'] = xr.concat([bit0, bit1], dim='bits')
xr.to_zarr(ds, encoding={'call_genotype_bits': {'filters': [PackBits()]}})

I suspect there's a way to do it in a single pass without creating two in-memory arrays and one more generic interface that numcodecs could support to cover this would be to have a bit packing filter optimized for 8-bit integers in a pre-specified range (something like -1 to 2 in this case for the 4 PLINK call states). We could call a function like https://github.com/pystatgen/sgkit/issues/85 first, then feed it to this filter. Decoding back to the 3D format would still be non-trivial but we'll probably need a function for that at some point anyhow.

Also, it doesn't feel like this is on the critical path for getting an initial release done - it will make our on-file storage a bit more compact, but won't affect anything else much, right?

I'm probably biased in trying to optimize for https://github.com/pystatgen/sgkit/issues/67 but I think there may be an argument in improving storage for data specific to PLINK since our current format with the separate genotype mask and extra ploidy dimension is so wasteful by comparison. I'm not sure if the 50%+ savings in storage and network transfers is enough to call it critical for everyone though.

daletovar commented 4 years ago

@hammer, yes I could add something like this if @alimanfoo gives the okay. @eric-czech's idea looks reasonable to me.

jeromekelleher commented 4 years ago

I think this is a useful feature we will want sooner or later - we can decide later where it should live. In the absence of @alimanfoo, I think it'd be a good thing for @daletovar to look at.