samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
663 stars 240 forks source link

Prevent bcftools sort from opening too many temporary files, and make it stable #2252

Closed daviesrob closed 1 month ago

daviesrob commented 1 month ago

Consolidation happens in layers of up to MAX_TMP_FILES_PER_LAYER (32) files. Files in full layers are merged into the next layer with a free slot. This ensures the number of comparisons remains O(n log n) even if lots of merges have to be done. The cost of doing this is the time needed to write and read back the data, and is in the worst case number of layers used * the size of the input. The benefit is to avoid I/O thrashing due top too many files; or completely failing due to running out of file descriptors.

This also turns on compression for the temporary files. There is a time penalty for this, but in testing it reduces the amount of storage needed by a factor of about four.

Neither qsort() nor heap sort are guaranteed to be stable. To ensure stability, add tie-breakers for qsort() based on the location in the memory buffer, and for heap sort based on the number of the input file records came from.

As bcf1_t is quite a big structure, it adds quite a lot of overhead if the records being sorted are small (e.g. single sample gVCF). This overhead can be reduced by storing the data in a more compact form. On a test file with approx. 61 characters per VCF line, up to four times as many records could be stored before having to spill them.

Fixes #2231