Open elswob opened 2 years ago
Hi @elswob, I did some work on this last year and found that on 1000 genomes chr22 (GT sparsity = 3.7%) I could store the extra fields at 46% of the size of BCF (i.e. about half the size).
You can see the details in slide 10 of this presentation, and the code is here. Using the zstd
compressor helped a lot I think.
BTW there's also a summary on https://github.com/tomwhite/ga4gh-variant-comparison
Have you got an example where the we don't do so well on compression @elswob? I'm guessing it's some particular INFO or FORMAT field that's not being dealt with well.
Hi both, thanks for the comments. Have you tried genozip https://genozip.readthedocs.io/index.html ? Be interested to see how that compares.
I can confirm similar numbers using sgkit for the 1000 genomes data used in the comparison. The file I'm seeing strange values with is a beagle imputation output VCF for around 1 million variants and 100 samples based on 1000 genomes data, around 40 MB. If I run it with the params here I get a 138 MB Zarr directory.
Hmm, I wonder if it's the genotype probability field that's taking up the space. Can you do a du -sh
on the zarr directory please to see what the size breakdown on the columns is?
128M call_DS
1.2M call_genotype
144K call_genotype_mask
140K call_genotype_phased
12K sample_id
3.5M variant_AF
920K variant_DR2
76K variant_IMP
2.7M variant_allele
76K variant_contig
76K variant_id
76K variant_id_mask
1.2M variant_position
There we go - I bet we're storing call_DS
as a 32 or 64 bit float which isn't compressing very well.
I bet we're storing
call_DS
as a 32 or 64 bit float which isn't compressing very well.
It will be a 32-bit float if it's a VCF float.
@elswob have you set max_alt_alleles
in the call to vcf_to_zarr
? We are storing fixed length arrays in Zarr, so the genotypes dimension can grow rapidly with max_alt_alleles
, even if it's mainly filled with empty values. One thing I wondered about is using Zarr ragged arrays for cases like this. I'm not sure how well it plays with Xarray, or indeed how efficient it is when processing though. (Note there was some discussion about Zarr ragged arrays in #634, but here we want to store floats, so the problem discussed there about storing ragged arrays of strings doesn't arise.)
Could you share the Xarray Dataset repr so we can see the dimensions and dataset attributes (e.g. max_alt_alleles_seen
)?
Another thought I had is trying out codecs like Quantize or Bitround. They are lossy though.
@tomwhite I have not set max_alt_alleles
in the call. Just tried with defaults, and the parameters in the example you provided.
@elswob thanks for the screenshot. The fact that max_alt_alleles_seen
is 1 means you could set max_alt_alleles
to 1 (for this dataset at least). Might be worth re-running to see what effect this has on the size of call_DS
is in this case. The number of genotypes will go from 10 to 3 if my maths is correct.
Thanks, that helps a bit, down to 75 MB using these params:
fields=["INFO/*", "FORMAT/*"],
max_alt_alleles=1,
(I tried the zstd
compressor but that resulted in larger output)
65M call_DS
628K call_genotype
436K call_genotype_mask
436K call_genotype_phased
12K sample_id
3.1M variant_AF
932K variant_DR2
436K variant_IMP
1.8M variant_allele
436K variant_contig
436K variant_id
436K variant_id_mask
1.4M variant_position
Thanks. Uncompressed, the DS field would take 406 MB, so that's a 6X compression factor with Zarr. I'm still puzzled why gzipped VCF (I assume that's what you are comparing to) is smaller though. Can you share more info about sparsity, typical values of the DS field, etc? Or maybe post your script to generate the VCF from 1000 genomes chr22?
I'm still puzzled why gzipped VCF is smaller though
I assume these are two digit base-10 values, which compress poorly when converted to float32? This is a common issue with converting these types of per-genotype floating point values, which take up a lot of space and don't contain a great deal of information. I think they need to be treated with some specific filters to get good performance.
I assume these are two digit base-10 values, which compress poorly when converted to float32? This is a common issue with converting these types of per-genotype floating point values, which take up a lot of space and don't contain a great deal of information.
Yes, that makes sense.
I think they need to be treated with some specific filters to get good performance.
Do you think the numcodecs filters like Quantize or Bitround would be suitable, or did you have something else in mind?
I'd be interested in trying out different filters if I could get some representative data to try them on.
Do you think the numcodecs filters like Quantize or Bitround would be suitable, or did you have something else in mind?
Yes, these are the types of things that should help I would imagine. There's a relatively small number of discrete things that are being stored (100 or 1000) and filters like this should map them into the right integer range to store with one byte rather than four.
I had a quick look around, but can't find any imputed datasets lying around. Any thoughts on where we'd find something representative @elswob?
I now have a representative VCF that is 17MB in size (compressed).
Running vct_to_zarr
on it with the default settings:
vcf_to_zarr(input=vcf_file, fields=["INFO/*", "FORMAT/*"], output=output, max_alt_alleles=1)
produced Zarr files totalling 37MB in size, as @elswob reported.
The following reduced the size to 14MB:
from numcodecs import BZ2, Blosc, Delta, FixedScaleOffset
from sgkit.io.vcf import vcf_to_zarr
compressor = Blosc(cname="zstd", clevel=7, shuffle=Blosc.AUTOSHUFFLE)
encoding = {
"variant_position": {
"filters": [Delta(dtype="i4", astype="i4")],
"compressor": compressor,
},
"variant_AF": {
"filters": [FixedScaleOffset(offset=0, scale=10000, dtype="f4", astype="u2")],
"compressor": compressor,
},
"call_DS": {
"filters": [FixedScaleOffset(offset=0, scale=100, dtype="f4", astype="u1")],
"compressor": compressor,
},
"variant_DR2": {
"filters": [FixedScaleOffset(offset=0, scale=100, dtype="f4", astype="u1")],
"compressor": compressor,
},
}
vcf_to_zarr(
input=vcf_file,
fields=["INFO/*", "FORMAT/*"],
output=output,
chunk_length=500_000,
max_alt_alleles=1,
compressor=compressor,
encoding=encoding,
)
There are three things here that helped:
FixedScaleOffset
filters on float fields, so only a few decimal places are stored. (I didn't try Quantize or Bitround, but the effect would probably be similar.)Delta
on POS, which is a monotonically increasing value (except at chromosome boundaries), so benefits from having this filter to store differences.I also tried using bzip2 compression:
compressor = BZ2(9)
With this (and reducing the number of decimal places for AF to 2), I got the size down to 8.3MB, albeit with a much slower decompression speed. This is about 25% larger than genozip (which also uses bzip2).
I think there are some things we can incorporate into the defaults (e.g. POS Delta
filter), and others we should document (e.g. storing float fields).
@ravwojdyla pointed out that #80 is related.
With #943 I was able to get the Zarr size down to about 16% larger than genozip on the test file (using bzip2).
Moving and storing large VCF files is a problem. One option is to transform them into a more compact data format, which contains enough data to fully restore the VCF at a later date (see https://github.com/pystatgen/sgkit/issues/924). The default vcf_to_zarr reduces the size of the data significantly, however, when extra fields are included, e.g.
["INFO/*", "FORMAT/*"]
the resulting zarr structure can be larger than the original VCF.