Create utilities or add examples that demonstrate how to store bgen zarr efficiently

sgkit-dev / sgkit-bgen

NOW ARCHIVED! BGEN IO implementations for sgkit. With https://github.com/pystatgen/sgkit/issues/65, now maintained and developed in the sgkit repo.

Apache License 2.0

0 stars 3 forks source link

Similar to https://github.com/pystatgen/sgkit/issues/80, we need a better way to store genotype probabilities and/or dosages from bgen. UKB chromosomes stored as 32-bit dosages in zstd compressed zarrs are almost twice as large as the original bgen, and the loss of probabilities means even less information is present.

Using a FixedScaleOffset filter like this looks to be equivalent to what's described in the v1 bgen spec (see "Probability data storage"):

numcodecs.FixedScaleOffset(
    offset=0, 
    scale=32_768, # 2**15
    dtype='float16', # Decoded data type (doesn't have to be 16-bit)
    astype='uint16' # Encoded data type
)

The choice of a 2**15 scale is deliberate even with uint16 to allow for probabilities to take values in [0, 2). The v2 spec uses parameterized scale factor and assumes probabilities are in [0, 1] instead. The rounding is a little more complicated in the v2 case but it doesn't seem like the difference is likely to be important -- plus it means you need all genotype probabilities for a call to do the rounding.

Lastly, the bgen paper shows that using 8-bit probabilities (256 scale factor) leads to virtually indistinguishable downstream results in association testing vs 20-bit probabilities, but similarity degrades quickly below 8 bits. The results section concludes by saying:

Given the findings above we recommend the use of b=8, 12 or 16 bits for imputed data, with b=8 providing a particularly good balance between implementation efficiency, accuracy, and overall file size

compressor = zarr.Blosc(cname="zstd", clevel=5, shuffle=2) encoding = {v: {"compressor": compressor} for v in ds} # Original values encoded as (v - offset) / scale_factor # See: https://www.unidata.ucar.edu/software/netcdf/docs/attribute_conventions.html encoding['call_genotype_probability'] = dict( compressor=compressor, dtype= 'uint8', _FillValue = 0, add_offset=-1./254., scale_factor=1./254. ) ds.to_zarr(..., encoding=encoding)

sgkit-dev / sgkit-bgen

Create utilities or add examples that demonstrate how to store bgen zarr efficiently #14