statgen / savvy

Interface to various variant calling formats.
Mozilla Public License 2.0
26 stars 5 forks source link

Parameter trade-off suggestions #16

Closed rick-heig closed 2 years ago

rick-heig commented 2 years ago

Hello. What would be a good starting point as for values for the parameters listed in the "Parameter Trade-offs" section of the README ?

If I would compress GT data with sav would the following command seem acceptable ?

sav import --phasing full --pbwt-field "GT" <input> <output>

Or do I need to specify a block size for it to work effectively ?

Could you share settings which gave a nice balance between size and speed ? (I understand this is dataset dependent but e.g., as used on TOPMed) Or do you have any general recommandations ?

Thanks ! Best Regards.

jonathonl commented 2 years ago

For GT, I recommend:

sav import -6 --phasing full <input> <output>

-6 is to increase the zstd compression level from the current default of 3. This default will likely be increased to 6 in a future release. Using -10 or -19 will decrease file size but slow down compression speed (leaving decompression speed unaffected).

GT data is imported as sparse vectors by default, which provides good [de]serialization speed and does a decent job of reducing file size. I recommend enabling PBWT only when a small file size is your top priority. In which case, you will need to specify a sparse vector threshold. Variants with allele frequencies above this threshold will be stored as dense vectors with PBWT applied. Variants below this threshold will be stored as sparse vectors without PBWT.

sav import -6 --phasing full --sparse-threshold 0.001 --pbwt-fields "GT" <input> <output>
rick-heig commented 2 years ago

Thank you for the recommendations and explanation, this is exactly what I was looking for.