This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
Consolidate temporary files when they become two numerous
Consolidation happens in layers of up to MAX_TMP_FILES_PER_LAYER (32) files. Files in full layers are merged into the next layer with a free slot. This ensures the number of comparisons remains O(n log n) even if lots of merges have to be done. The cost of doing this is the time needed to write and read back the data, and is in the worst case number of layers used * the size of the input. The benefit is to avoid I/O thrashing due top too many files; or completely failing due to running out of file descriptors.
This also turns on compression for the temporary files. There is a time penalty for this, but in testing it reduces the amount of storage needed by a factor of about four.
Make bcftools sort stable, so ties always retain the ordering they had in the input file
Neither qsort() nor heap sort are guaranteed to be stable. To ensure stability, add tie-breakers for qsort() based on the location in the memory buffer, and for heap sort based on the number of the input file records came from.
Pack more records in memory when sorting
As bcf1_t is quite a big structure, it adds quite a lot of overhead if the records being sorted are small (e.g. single sample gVCF). This overhead can be reduced by storing the data in a more compact form. On a test file with approx. 61 characters per VCF line, up to four times as many records could be stored before having to spill them.
Consolidation happens in layers of up to
MAX_TMP_FILES_PER_LAYER
(32) files. Files in full layers are merged into the next layer with a free slot. This ensures the number of comparisons remains O(n log n) even if lots of merges have to be done. The cost of doing this is the time needed to write and read back the data, and is in the worst case number of layers used * the size of the input. The benefit is to avoid I/O thrashing due top too many files; or completely failing due to running out of file descriptors.This also turns on compression for the temporary files. There is a time penalty for this, but in testing it reduces the amount of storage needed by a factor of about four.
Neither
qsort()
nor heap sort are guaranteed to be stable. To ensure stability, add tie-breakers forqsort()
based on the location in the memory buffer, and for heap sort based on the number of the input file records came from.As bcf1_t is quite a big structure, it adds quite a lot of overhead if the records being sorted are small (e.g. single sample gVCF). This overhead can be reduced by storing the data in a more compact form. On a test file with approx. 61 characters per VCF line, up to four times as many records could be stored before having to spill them.
Fixes #2231