Closed rick-heig closed 2 years ago
For GT, I recommend:
sav import -6 --phasing full <input> <output>
-6
is to increase the zstd compression level from the current default of 3. This default will likely be increased to 6 in a future release. Using -10
or -19
will decrease file size but slow down compression speed (leaving decompression speed unaffected).
GT data is imported as sparse vectors by default, which provides good [de]serialization speed and does a decent job of reducing file size. I recommend enabling PBWT only when a small file size is your top priority. In which case, you will need to specify a sparse vector threshold. Variants with allele frequencies above this threshold will be stored as dense vectors with PBWT applied. Variants below this threshold will be stored as sparse vectors without PBWT.
sav import -6 --phasing full --sparse-threshold 0.001 --pbwt-fields "GT" <input> <output>
Thank you for the recommendations and explanation, this is exactly what I was looking for.
Hello. What would be a good starting point as for values for the parameters listed in the "Parameter Trade-offs" section of the README ?
If I would compress GT data with
sav
would the following command seem acceptable ?Or do I need to specify a block size for it to work effectively ?
Could you share settings which gave a nice balance between size and speed ? (I understand this is dataset dependent but e.g., as used on TOPMed) Or do you have any general recommandations ?
Thanks ! Best Regards.