mklarqvist / djinn

C++ library for analysing and storing large-scale cohorts of sequence variant data
Apache License 2.0
17 stars 1 forks source link

Fixed-order permute input sample order #6

Open mklarqvist opened 5 years ago

mklarqvist commented 5 years ago

The sample order is highly important in determining the compressibility of the final archive. For example, the 1000 Genomes Project Vcf files have samples partially sorted according to geographical location resulting in considerably better base level compression then for example the HRC dataset where where sample order are simply concatenated according to the participating study. We would like to provide an additional algorithm for preprocessing input data according to some fixed permutation order that is determined a priori. Finding the optimal permutation is NP-complete but can be approximated in various ways. A crude solution could be to take the final PBWT-based permutation order after processing a file and use that fixed order in a second pass of the data.

Although this additional layer of processing is computationally expensive, it would most likely be performed only once per frozen dataset.