samtools / htslib

C library for high-throughput sequencing data formats
Other
810 stars 447 forks source link

Fix n-squared complexity in sample line with many adjacent tabs #1503

Closed daviesrob closed 2 years ago

daviesrob commented 2 years ago

This could be triggered by a #CHROM line ending in something like:

#CHROM\t...\tINFO\t\t\t\t\t\t ... many tabs ... \t\t\tfoo\n

Between each pair of tabs, bcf_hdr_add_sample_len() was called with len = 0, as if from bcf_hdr_add_sample(). This made it use strlen(s) instead of 0 as the sample name length, resulting in the addition of a bogus sample name with lots of leading tabs. The sample line parser then moved on to the next tab, and did the same thing again with one fewer leading tab.

Fix by making bcf_hdr_add_sample_len() always use the passed-in length, even if 0, allowing the empty sample name trap to do its work. bcf_hdr_add_sample() is updated to call strlen() itself, and to also deal with the backwards-compatibility check where it was permissible to call it with a NULL string.