The sample order is highly important in determining the compressibility of the final archive. For example, the 1000 Genomes Project Vcf files have samples partially sorted according to geographical location resulting in considerably better base level compression then for example the HRC dataset where where sample order are simply concatenated according to the participating study. We would like to provide an additional algorithm for preprocessing input data according to some fixed permutation order that is determined a priori. Finding the optimal permutation is NP-complete but can be approximated in various ways. A crude solution could be to take the final PBWT-based permutation order after processing a file and use that fixed order in a second pass of the data.
Although this additional layer of processing is computationally expensive, it would most likely be performed only once per frozen dataset.
The sample order is highly important in determining the compressibility of the final archive. For example, the 1000 Genomes Project Vcf files have samples partially sorted according to geographical location resulting in considerably better base level compression then for example the HRC dataset where where sample order are simply concatenated according to the participating study. We would like to provide an additional algorithm for preprocessing input data according to some fixed permutation order that is determined a priori. Finding the optimal permutation is NP-complete but can be approximated in various ways. A crude solution could be to take the final PBWT-based permutation order after processing a file and use that fixed order in a second pass of the data.
Although this additional layer of processing is computationally expensive, it would most likely be performed only once per frozen dataset.