sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
234 stars 32 forks source link

maximise lossless compression of vcf_to_zarr #925

Open elswob opened 2 years ago

elswob commented 2 years ago

Moving and storing large VCF files is a problem. One option is to transform them into a more compact data format, which contains enough data to fully restore the VCF at a later date (see https://github.com/pystatgen/sgkit/issues/924). The default vcf_to_zarr reduces the size of the data significantly, however, when extra fields are included, e.g. ["INFO/*", "FORMAT/*"] the resulting zarr structure can be larger than the original VCF.

tomwhite commented 2 years ago

Hi @elswob, I did some work on this last year and found that on 1000 genomes chr22 (GT sparsity = 3.7%) I could store the extra fields at 46% of the size of BCF (i.e. about half the size).

You can see the details in slide 10 of this presentation, and the code is here. Using the zstd compressor helped a lot I think.

tomwhite commented 2 years ago

BTW there's also a summary on https://github.com/tomwhite/ga4gh-variant-comparison

jeromekelleher commented 2 years ago

Have you got an example where the we don't do so well on compression @elswob? I'm guessing it's some particular INFO or FORMAT field that's not being dealt with well.

elswob commented 2 years ago

Hi both, thanks for the comments. Have you tried genozip https://genozip.readthedocs.io/index.html ? Be interested to see how that compares.

I can confirm similar numbers using sgkit for the 1000 genomes data used in the comparison. The file I'm seeing strange values with is a beagle imputation output VCF for around 1 million variants and 100 samples based on 1000 genomes data, around 40 MB. If I run it with the params here I get a 138 MB Zarr directory.

jeromekelleher commented 2 years ago

Hmm, I wonder if it's the genotype probability field that's taking up the space. Can you do a du -sh on the zarr directory please to see what the size breakdown on the columns is?

elswob commented 2 years ago
128M    call_DS
1.2M    call_genotype
144K    call_genotype_mask
140K    call_genotype_phased
 12K    sample_id
3.5M    variant_AF
920K    variant_DR2
 76K    variant_IMP
2.7M    variant_allele
 76K    variant_contig
 76K    variant_id
 76K    variant_id_mask
1.2M    variant_position
jeromekelleher commented 2 years ago

There we go - I bet we're storing call_DS as a 32 or 64 bit float which isn't compressing very well.

tomwhite commented 2 years ago

I bet we're storing call_DS as a 32 or 64 bit float which isn't compressing very well.

It will be a 32-bit float if it's a VCF float.

@elswob have you set max_alt_alleles in the call to vcf_to_zarr? We are storing fixed length arrays in Zarr, so the genotypes dimension can grow rapidly with max_alt_alleles, even if it's mainly filled with empty values. One thing I wondered about is using Zarr ragged arrays for cases like this. I'm not sure how well it plays with Xarray, or indeed how efficient it is when processing though. (Note there was some discussion about Zarr ragged arrays in #634, but here we want to store floats, so the problem discussed there about storing ragged arrays of strings doesn't arise.)

Could you share the Xarray Dataset repr so we can see the dimensions and dataset attributes (e.g. max_alt_alleles_seen)?

Another thought I had is trying out codecs like Quantize or Bitround. They are lossy though.

elswob commented 2 years ago

@tomwhite I have not set max_alt_alleles in the call. Just tried with defaults, and the parameters in the example you provided.

Screenshot 2022-10-07 at 12 39 48
tomwhite commented 2 years ago

@elswob thanks for the screenshot. The fact that max_alt_alleles_seen is 1 means you could set max_alt_alleles to 1 (for this dataset at least). Might be worth re-running to see what effect this has on the size of call_DS is in this case. The number of genotypes will go from 10 to 3 if my maths is correct.

elswob commented 2 years ago

Thanks, that helps a bit, down to 75 MB using these params:

fields=["INFO/*", "FORMAT/*"], 
max_alt_alleles=1,

(I tried the zstd compressor but that resulted in larger output)

Screenshot 2022-10-07 at 13 18 49
 65M    call_DS
628K    call_genotype
436K    call_genotype_mask
436K    call_genotype_phased
 12K    sample_id
3.1M    variant_AF
932K    variant_DR2
436K    variant_IMP
1.8M    variant_allele
436K    variant_contig
436K    variant_id
436K    variant_id_mask
1.4M    variant_position
tomwhite commented 2 years ago

Thanks. Uncompressed, the DS field would take 406 MB, so that's a 6X compression factor with Zarr. I'm still puzzled why gzipped VCF (I assume that's what you are comparing to) is smaller though. Can you share more info about sparsity, typical values of the DS field, etc? Or maybe post your script to generate the VCF from 1000 genomes chr22?

jeromekelleher commented 2 years ago

I'm still puzzled why gzipped VCF is smaller though

I assume these are two digit base-10 values, which compress poorly when converted to float32? This is a common issue with converting these types of per-genotype floating point values, which take up a lot of space and don't contain a great deal of information. I think they need to be treated with some specific filters to get good performance.

tomwhite commented 2 years ago

I assume these are two digit base-10 values, which compress poorly when converted to float32? This is a common issue with converting these types of per-genotype floating point values, which take up a lot of space and don't contain a great deal of information.

Yes, that makes sense.

I think they need to be treated with some specific filters to get good performance.

Do you think the numcodecs filters like Quantize or Bitround would be suitable, or did you have something else in mind?

I'd be interested in trying out different filters if I could get some representative data to try them on.

jeromekelleher commented 2 years ago

Do you think the numcodecs filters like Quantize or Bitround would be suitable, or did you have something else in mind?

Yes, these are the types of things that should help I would imagine. There's a relatively small number of discrete things that are being stored (100 or 1000) and filters like this should map them into the right integer range to store with one byte rather than four.

I had a quick look around, but can't find any imputed datasets lying around. Any thoughts on where we'd find something representative @elswob?

tomwhite commented 2 years ago

I now have a representative VCF that is 17MB in size (compressed).

Running vct_to_zarr on it with the default settings:

vcf_to_zarr(input=vcf_file, fields=["INFO/*", "FORMAT/*"], output=output, max_alt_alleles=1)

produced Zarr files totalling 37MB in size, as @elswob reported.

The following reduced the size to 14MB:

from numcodecs import BZ2, Blosc, Delta, FixedScaleOffset

from sgkit.io.vcf import vcf_to_zarr

compressor = Blosc(cname="zstd", clevel=7, shuffle=Blosc.AUTOSHUFFLE)
encoding = {
    "variant_position": {
        "filters": [Delta(dtype="i4", astype="i4")],
        "compressor": compressor,
    },
    "variant_AF": {
        "filters": [FixedScaleOffset(offset=0, scale=10000, dtype="f4", astype="u2")],
        "compressor": compressor,
    },
    "call_DS": {
        "filters": [FixedScaleOffset(offset=0, scale=100, dtype="f4", astype="u1")],
        "compressor": compressor,
    },
    "variant_DR2": {
        "filters": [FixedScaleOffset(offset=0, scale=100, dtype="f4", astype="u1")],
        "compressor": compressor,
    },
}
vcf_to_zarr(
    input=vcf_file,
    fields=["INFO/*", "FORMAT/*"],
    output=output,
    chunk_length=500_000,
    max_alt_alleles=1,
    compressor=compressor,
    encoding=encoding,
)

There are three things here that helped:

  1. Use FixedScaleOffset filters on float fields, so only a few decimal places are stored. (I didn't try Quantize or Bitround, but the effect would probably be similar.)
  2. Use Delta on POS, which is a monotonically increasing value (except at chromosome boundaries), so benefits from having this filter to store differences.
  3. Increase the chunk size in the variants dimension from 10,000 to 500,000. This depends a lot on the number of samples, which in this case was very small (~10).

I also tried using bzip2 compression:

compressor = BZ2(9)

With this (and reducing the number of decimal places for AF to 2), I got the size down to 8.3MB, albeit with a much slower decompression speed. This is about 25% larger than genozip (which also uses bzip2).

I think there are some things we can incorporate into the defaults (e.g. POS Delta filter), and others we should document (e.g. storing float fields).

tomwhite commented 2 years ago

@ravwojdyla pointed out that #80 is related.

tomwhite commented 2 years ago

With #943 I was able to get the Zarr size down to about 16% larger than genozip on the test file (using bzip2).