Understanding sav benefits.

lbergelson commented 3 years ago

I'm trying to catch up on what's been going on in the world of alternat vcf representations and I'm trying to understand what the benefits of savvy are vs bcf. I've run into a few questions.

It seems like the big difference is the addition of a sparse vector type. The random vcf files I've tried savving haven't seen any appreciable size improvement from running sav import on them though so I was wondering if you had some examples of files that benefited from using savvy. I suspect I'm either using files that don't particularly benefit from the sparsity reduction, or I've misconfigured my import.
I don't understand how PBWT is used by sav files and what benefit that gives. Does it only apply to genotype fields? I tried looking in the code, but I couldn't find where it actually computes PBWT. It seems like it's just tagging fields as being PBWT sorted? Is this passing through something processed upstream and just acting as a marker for it? How is this intended to be used? I'm not really a C++ programmer so I may have just missed something obvious.
From what I can tell sav doesn't directly address the problem of encoding gvcf files efficiently. (Although they could probably benefit from the sparse vector type when encoding sparse PLs.). Is that outside of the mandate of the sav format?

Thank you. Let me know if there's a better forum for asking general non-code questions about savvy.

jonathonl commented 3 years ago

SAV will show the most improvement for datasets with large sample sizes that contain either WGS genotypes (GT-only) or imputed genotypes stored as DS or HDS. Non-sparse data can see improvements when enabling PBWT for those fields.

Can you tell me a little more about these VCF files you've experimented with:
- How many samples?
- Which FMT fields are stored?
- Were they called from whole genome sequencing, exome sequencing, genotyping array, ...?
PBWT has shown to be effective on GT, DP, AD, PL, and GQ (former more so than latter). It will usually work well for the same types of data that benefit from compressed columnar storage. The transformation occurs immediately before/after serializing/deserializing. Currently, you must specify which fields you want it to be applied to via --pbwt-fields.
If by GVCF you are referring to single sample VCFs with depth and genotype likelihood information, then no. SAV is not a solution for improving the compression of single sample VCFs. For those, a columnar storage format would work better. With that said, SAV should not be any worse than BCF for this scenario.

We are working on a manuscript that will provide a more exhaustive comparison with BCF and other file formats, but I can give you a sneak peak with 1000 genomes data. While SAV does well at compressing 1000g, the improvements compared to BCF are much greater when you scale to hundreds of thousands of samples.

curl ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz > chr20.vcf.gz
bcftools view chr20.vcf.gz -Ob > chr20.bcf
sav import chr20.vcf.gz --phasing full > chr20.sav
sav import chr20.vcf.gz --phasing full --pbwt-fields GT --sparse-threshold 0.01 > chr20.pbwt.sav
sav import chr20.vcf.gz --phasing full --pbwt-fields GT --sparse-threshold 0.01 -19 > chr20.pbwt.c19.sav

Format	File Size (proportion to BCF)
vcf	312M (1.20)
bcf	260M (1.00)
sav	102M (0.39)
sav.pbwt	69M (0.27)
sav.pbwt.c19	56M (0.22)

lbergelson commented 3 years ago

@jonathonl Thank you! That's very helpful.

I tried on a few misc vcfs I had laying around but the widest one was a 1000G sample. Sav just didn't seem to make any difference compared to BCF with what I tested. I figured you must have some good examples lying around so I just asked instead of doing much experimentation. The 1000G file I used included these fields. GT:AD:DP:GQ:PL.
I guess I'm confused about how PBWT works. I thought it could only be applied to fields with a haplotype like structure. Do you sort based on GT and then see some compression improvements from sorting other fields in the same order as the GT?
By GVCF I mean both single sample files with depth and additional fields like likelihoods and depths. I also mean largescale combined gvcfs which include that information for many samples. It's unclear to me how much sav benefits that use case.

statgen / savvy

Understanding sav benefits. #11