samtools / htslib

C library for high-throughput sequencing data formats
Other
800 stars 446 forks source link

Avoid sprintf in cram_to_bam #150

Closed lomereiter closed 9 years ago

lomereiter commented 9 years ago

Performance killer, takes 60% of the total time spent in the function. By the way, there's a reasonable FIXME there about including container numbers in the name, but that's another matter.

callgrind

lomereiter commented 9 years ago

FWIW, here's the improvement using kstring_t

diff --git cram/cram_decode.c cram/cram_decode.c
index f6f6822..02b90e7 100644
--- cram/cram_decode.c
+++ cram/cram_decode.c
@@ -52,6 +52,8 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #include "cram/os.h"
 #include "cram/md5.h"

+#include "htslib/kstring.h"
+
 //Whether CIGAR has just M or uses = and X to indicate match and mismatch
 //#define USE_X

@@ -2315,6 +2317,7 @@ int cram_to_bam(SAM_hdr *bfd, cram_fd *fd, cram_slice *s,
     int bam_idx, rg_len;
     char name_a[1024], *name;
     int name_len;
+    kstring_t name_s = { 0, 1024, name_a };
     char *aux, *aux_orig;
     char *seq, *qual;

@@ -2325,14 +2328,20 @@ int cram_to_bam(SAM_hdr *bfd, cram_fd *fd, cram_slice *s,
        name_len = cr->name_len;
    } else {
        // FIXME: add prefix, container number, slice number, etc
-       name = name_a;
+       kputs(fd->prefix, &name_s);
+       kputc(':', &name_s);
+       kputl(s->id, &name_s);
+       kputc(':', &name_s);

        if (cr->mate_line >= 0 && cr->mate_line < rec)
-       name_len = sprintf(name_a, "%s:%"PRId64":%d",
-                  fd->prefix, s->id, cr->mate_line);
+           kputw(cr->mate_line, &name_s);
        else
-       name_len = sprintf(name_a, "%s:%"PRId64":%d",
-                  fd->prefix, s->id, rec);
+           kputw(rec, &name_s);
+
+        assert(name_a == name_s.s);
+        name = name_a;
+        name_len = name_s.l;
    }
     } else {
    name = "?";
jmarshall commented 9 years ago

I think that counts as kstring_t abuse: it'll explode in kputs if it needs to reallocate the buffer, and there is no limit on the length of fd->prefix. However something similar can be done...

Avoiding sprintf here could save up to 60% of the time in this function, but what effect does this have on overall execution time? Do you have timing figures for samtools view -bo output.bam hugeinput.cram (or similar) comparing your modified cram_to_bam() to the sprintf-using one?

lomereiter commented 9 years ago

Read names in BAM can't be longer than 255 bytes anyway, so if fd->prefix is too long, it should be properly handled.

I noticed the difference when running samtools view -c (converts everything into BAM internally):

# before
$ time ./samtools view -c $CRAM_FILE 20
3377663

real    0m5.693s
user    0m5.443s
sys     0m0.253s

#after
$ time ./samtools view -c $CRAM_FILE 20
3377663

real    0m4.873s
user    0m4.677s
sys     0m0.203s

I agree that when BAM writing is added upon that (+5 to +45 seconds depending on the chosen compression level), the effect becomes almost invisible. Thus, you are free to close this issue if you don't share my perfectionist attitude :-)

jkbonfield commented 9 years ago

On Wed, Nov 26, 2014 at 06:48:14AM -0800, Artem Tarasov wrote:

I agree that when BAM writing is added upon that (+5 to +45 seconds depending on the chosen compression level), the effect becomes almost invisible. Thus, you are free to close this issue if you don't share my perfectionist attitude :-)

I've gone through elsewhere doing similar things. Printf (& sscanf) are just diabolically slow. Indeed the initial round of using kputw in samtools many years ago was a result of me pointing it out to Heng. :-)

So I'm happy with improving this, but I haven't yet had a chance to look into the patch. I'm befuddled with a cold atm but will check this out soon.

Thanks for looking into it.

James

James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

jkbonfield commented 9 years ago

This code needs a rewrite anyway. I see it doesn't adhere to the CRAM v2 definition of having the record number start point in the CRAM slice. Rather I just count up slice numbers from the start of the file and generate names like prefix:1:1, prefix:1:2, ... prefix:1: then prefix:2:1, etc.

To allow random access to generate the same names as non-random access we changed the specification such that the starting record number was coded in the slice header itself. The expectation then is just prefix:N to prefix:M. It doesn't address the issue of sprintf, but I'll look into this more. If needs be I'll just have a reusable kstring_t (using malloc as per the klib wishes) in the cram_fd master structure.

jkbonfield commented 9 years ago

I've found another problem with this fix. In CRAMv3 the record counter is a 64-bit quantity. While kputl uses type long, which is sufficient on some platforms, it is not guaranteed. klib has no concept of int32_t or int64_t so we're kind of over a barrel on that front.

We either need to reimplement it ourselves (as was already done in append_uint in cram_io.h) or fix klib and push a patch back upstream. I'm tempted to say just a local function to encode uint64_t (technically it's signed in the specification, but there is no case the counter can ever be negative) to a buffer is the cleaner solution that trying to abuse klib to write to our own buffer. (Or ideally have klib internally allowing number formatting to be externalised a little bit.)

Any thoughts?

jmarshall commented 9 years ago

Indeed. I was toying with giving bam_construct_seq() the qname in two halves (patch available on request) and just writing it out inline as something like

+           uint64_t x;
+           char *s = &suffix_buffer[sizeof suffix_buffer];
+
            // FIXME: add prefix, container number, slice number, etc
-           name = name_a;
+           name = fd->prefix;
+           name_len = strlen(fd->prefix);
+
+           /* write cr->mate_line / rec backwards to s */
+
+           *--s = ':';
+           if (s->id == 0) *--s = '0';
+           else for (x = s->id; x > 0; x /= 10) *--s = x%10 + '0';
+           *--s = ':';

+           suffix = s;
+           suffix_len = &suffix_buffer[sizeof suffix_buffer] - s;

...

     bam_idx = bam_construct_seq(bam, cr->aux_size + rg_len,
-                               name, name_len,
+                               name, name_len, suffix, suffix_len,
                                cr->flags,
                                cr->ref_id,
                                cr->apos,

...if you see what I mean. Mad prematurely-optimizing nutters 'r' us :smile:

jkbonfield commented 9 years ago

On Thu, Nov 27, 2014 at 06:32:20AM -0800, John Marshall wrote:

  • *--s = ':';
  • if (s->id == 0) *--s = '0';
  • else for (x = s->id; x > 0; x /= 10) *--s = x%10 + '0';
  • *--s = ':';

Given s->id was incorrect (it predated having s->hdr->record_counter in the CRAM spec, and that while s->id was int32 (4 billion slices is enough), record_counter is not (4 billion sequences is NOT enough), then this needs to work on a 64-bit quantity.

If we're talking about overly optimising, then note that division of 64 bit quantities is going to be slow, so successive /= 10 and %10 are going to be too time consuming.

An alternative, is to break the 64-bit quantity into 2-3 quantities between 0 and 1000000000 (ie 32-bit) and convert these using the existing integer encoding code. With a little bit of tweaking to handle leading zeros correctly. This means we do lots of 32-bit divisions and only a few 64-bit ones. Overly optimised? Hell yes, but wasn't that the whole point of this? :-)

Eg:

static inline unsigned char append_uint64(unsigned char cp, uint64_t i) { uint64_t j;

if (i <= 0xffffffff)
    return append_uint32(cp, i);

if ((j = i/1000000000) > 1000000000) {
    cp = append_uint32(cp, j/1000000000);
    j %= 1000000000;
    cp = append_sub32(cp, j);
} else {
    cp = append_uint32(cp, i / 1000000000);
}
cp = append_sub32(cp, i % 1000000000);

}

Patch incoming soon for evaluation.

James

James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

jkbonfield commented 9 years ago

Still unsure what's the tidiest way to do this, but rightly or wrongly we alread had a (hideous) integer encoding function so I just ran with it further.

In fixing this though I noticed another range-query+EOF bug to fix. Joy!

jkbonfield commented 9 years ago

It looks like this was fixed with PR #151 above, but failed to correctly report that it fixes this issue to it didn't auto-close.