sgkit-dev / sgkit-bgen

NOW ARCHIVED! BGEN IO implementations for sgkit. With https://github.com/pystatgen/sgkit/issues/65, now maintained and developed in the sgkit repo.
Apache License 2.0
0 stars 3 forks source link

Create utilities or add examples that demonstrate how to store bgen zarr efficiently #14

Open eric-czech opened 4 years ago

eric-czech commented 4 years ago

Similar to https://github.com/pystatgen/sgkit/issues/80, we need a better way to store genotype probabilities and/or dosages from bgen. UKB chromosomes stored as 32-bit dosages in zstd compressed zarrs are almost twice as large as the original bgen, and the loss of probabilities means even less information is present.

Using a FixedScaleOffset filter like this looks to be equivalent to what's described in the v1 bgen spec (see "Probability data storage"):

numcodecs.FixedScaleOffset(
    offset=0, 
    scale=32_768, # 2**15
    dtype='float16', # Decoded data type (doesn't have to be 16-bit)
    astype='uint16' # Encoded data type
)

The choice of a 2**15 scale is deliberate even with uint16 to allow for probabilities to take values in [0, 2). The v2 spec uses parameterized scale factor and assumes probabilities are in [0, 1] instead. The rounding is a little more complicated in the v2 case but it doesn't seem like the difference is likely to be important -- plus it means you need all genotype probabilities for a call to do the rounding.

Lastly, the bgen paper shows that using 8-bit probabilities (256 scale factor) leads to virtually indistinguishable downstream results in association testing vs 20-bit probabilities, but similarity degrades quickly below 8 bits. The results section concludes by saying:

Given the findings above we recommend the use of b=8, 12 or 16 bits for imputed data, with b=8 providing a particularly good balance between implementation efficiency, accuracy, and overall file size

eric-czech commented 4 years ago

Note: There also several ways to encode values with Xarray before compressing them. Numcodecs is one option but the biggest limitation with this is that the filters don't support nan values. They simply map them to some other valid value (e.g. 0).

Xarray supports numcodecs filters by passing arguments from them to Zarr, but also supports encoding floating point values directly through any scheme compliant with CF Conventions. An example with bgen would look like this:

compressor = zarr.Blosc(cname="zstd", clevel=5, shuffle=2)
encoding = {v: {"compressor": compressor} for v in ds}
# Original values encoded as (v - offset) / scale_factor
# See: https://www.unidata.ucar.edu/software/netcdf/docs/attribute_conventions.html
encoding['call_genotype_probability'] = dict(
    compressor=compressor,
    dtype= 'uint8',
    _FillValue = 0,
    add_offset=-1./254.,
    scale_factor=1./254.
)
ds.to_zarr(..., encoding=encoding)

The above would map floats in [0, 1] to [1, 255], reserve 0 for nan, and write out as uint8.

See: https://xarray.pydata.org/en/stable/io.html#writing-encoded-data