Closed lomereiter closed 9 years ago
FWIW, here's the improvement using kstring_t
diff --git cram/cram_decode.c cram/cram_decode.c
index f6f6822..02b90e7 100644
--- cram/cram_decode.c
+++ cram/cram_decode.c
@@ -52,6 +52,8 @@ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#include "cram/os.h"
#include "cram/md5.h"
+#include "htslib/kstring.h"
+
//Whether CIGAR has just M or uses = and X to indicate match and mismatch
//#define USE_X
@@ -2315,6 +2317,7 @@ int cram_to_bam(SAM_hdr *bfd, cram_fd *fd, cram_slice *s,
int bam_idx, rg_len;
char name_a[1024], *name;
int name_len;
+ kstring_t name_s = { 0, 1024, name_a };
char *aux, *aux_orig;
char *seq, *qual;
@@ -2325,14 +2328,20 @@ int cram_to_bam(SAM_hdr *bfd, cram_fd *fd, cram_slice *s,
name_len = cr->name_len;
} else {
// FIXME: add prefix, container number, slice number, etc
- name = name_a;
+ kputs(fd->prefix, &name_s);
+ kputc(':', &name_s);
+ kputl(s->id, &name_s);
+ kputc(':', &name_s);
if (cr->mate_line >= 0 && cr->mate_line < rec)
- name_len = sprintf(name_a, "%s:%"PRId64":%d",
- fd->prefix, s->id, cr->mate_line);
+ kputw(cr->mate_line, &name_s);
else
- name_len = sprintf(name_a, "%s:%"PRId64":%d",
- fd->prefix, s->id, rec);
+ kputw(rec, &name_s);
+
+ assert(name_a == name_s.s);
+ name = name_a;
+ name_len = name_s.l;
}
} else {
name = "?";
I think that counts as kstring_t
abuse: it'll explode in kputs
if it needs to reallocate the buffer, and there is no limit on the length of fd->prefix
. However something similar can be done...
Avoiding sprintf
here could save up to 60% of the time in this function, but what effect does this have on overall execution time? Do you have timing figures for samtools view -bo output.bam hugeinput.cram
(or similar) comparing your modified cram_to_bam()
to the sprintf-using one?
Read names in BAM can't be longer than 255 bytes anyway, so if fd->prefix
is too long, it should be properly handled.
I noticed the difference when running samtools view -c
(converts everything into BAM internally):
# before
$ time ./samtools view -c $CRAM_FILE 20
3377663
real 0m5.693s
user 0m5.443s
sys 0m0.253s
#after
$ time ./samtools view -c $CRAM_FILE 20
3377663
real 0m4.873s
user 0m4.677s
sys 0m0.203s
I agree that when BAM writing is added upon that (+5 to +45 seconds depending on the chosen compression level), the effect becomes almost invisible. Thus, you are free to close this issue if you don't share my perfectionist attitude :-)
On Wed, Nov 26, 2014 at 06:48:14AM -0800, Artem Tarasov wrote:
I agree that when BAM writing is added upon that (+5 to +45 seconds depending on the chosen compression level), the effect becomes almost invisible. Thus, you are free to close this issue if you don't share my perfectionist attitude :-)
I've gone through elsewhere doing similar things. Printf (& sscanf) are just diabolically slow. Indeed the initial round of using kputw in samtools many years ago was a result of me pointing it out to Heng. :-)
So I'm happy with improving this, but I haven't yet had a chance to look into the patch. I'm befuddled with a cold atm but will check this out soon.
Thanks for looking into it.
James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
This code needs a rewrite anyway. I see it doesn't adhere to the CRAM v2 definition of having the record number start point in the CRAM slice. Rather I just count up slice numbers from the start of the file and generate names like prefix:1:1, prefix:1:2, ... prefix:1:
To allow random access to generate the same names as non-random access we changed the specification such that the starting record number was coded in the slice header itself. The expectation then is just prefix:N to prefix:M. It doesn't address the issue of sprintf, but I'll look into this more. If needs be I'll just have a reusable kstring_t (using malloc as per the klib wishes) in the cram_fd master structure.
I've found another problem with this fix. In CRAMv3 the record counter is a 64-bit quantity. While kputl uses type long, which is sufficient on some platforms, it is not guaranteed. klib has no concept of int32_t or int64_t so we're kind of over a barrel on that front.
We either need to reimplement it ourselves (as was already done in append_uint in cram_io.h) or fix klib and push a patch back upstream. I'm tempted to say just a local function to encode uint64_t (technically it's signed in the specification, but there is no case the counter can ever be negative) to a buffer is the cleaner solution that trying to abuse klib to write to our own buffer. (Or ideally have klib internally allowing number formatting to be externalised a little bit.)
Any thoughts?
Indeed. I was toying with giving bam_construct_seq()
the qname in two halves (patch available on request) and just writing it out inline as something like
+ uint64_t x;
+ char *s = &suffix_buffer[sizeof suffix_buffer];
+
// FIXME: add prefix, container number, slice number, etc
- name = name_a;
+ name = fd->prefix;
+ name_len = strlen(fd->prefix);
+
+ /* write cr->mate_line / rec backwards to s */
+
+ *--s = ':';
+ if (s->id == 0) *--s = '0';
+ else for (x = s->id; x > 0; x /= 10) *--s = x%10 + '0';
+ *--s = ':';
+ suffix = s;
+ suffix_len = &suffix_buffer[sizeof suffix_buffer] - s;
...
bam_idx = bam_construct_seq(bam, cr->aux_size + rg_len,
- name, name_len,
+ name, name_len, suffix, suffix_len,
cr->flags,
cr->ref_id,
cr->apos,
...if you see what I mean. Mad prematurely-optimizing nutters 'r' us :smile:
On Thu, Nov 27, 2014 at 06:32:20AM -0800, John Marshall wrote:
- *--s = ':';
- if (s->id == 0) *--s = '0';
- else for (x = s->id; x > 0; x /= 10) *--s = x%10 + '0';
- *--s = ':';
Given s->id was incorrect (it predated having s->hdr->record_counter in the CRAM spec, and that while s->id was int32 (4 billion slices is enough), record_counter is not (4 billion sequences is NOT enough), then this needs to work on a 64-bit quantity.
If we're talking about overly optimising, then note that division of 64 bit quantities is going to be slow, so successive /= 10 and %10 are going to be too time consuming.
An alternative, is to break the 64-bit quantity into 2-3 quantities between 0 and 1000000000 (ie 32-bit) and convert these using the existing integer encoding code. With a little bit of tweaking to handle leading zeros correctly. This means we do lots of 32-bit divisions and only a few 64-bit ones. Overly optimised? Hell yes, but wasn't that the whole point of this? :-)
Eg:
static inline unsigned char append_uint64(unsigned char cp, uint64_t i) { uint64_t j;
if (i <= 0xffffffff)
return append_uint32(cp, i);
if ((j = i/1000000000) > 1000000000) {
cp = append_uint32(cp, j/1000000000);
j %= 1000000000;
cp = append_sub32(cp, j);
} else {
cp = append_uint32(cp, i / 1000000000);
}
cp = append_sub32(cp, i % 1000000000);
}
Patch incoming soon for evaluation.
James
James Bonfield (jkb@sanger.ac.uk) | Hora aderat briligi. Nunc et Slythia Tova | Plurima gyrabant gymbolitare vabo; A Staden Package developer: | Et Borogovorum mimzebant undique formae, https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
Still unsure what's the tidiest way to do this, but rightly or wrongly we alread had a (hideous) integer encoding function so I just ran with it further.
In fixing this though I noticed another range-query+EOF bug to fix. Joy!
It looks like this was fixed with PR #151 above, but failed to correctly report that it fixes this issue to it didn't auto-close.
Performance killer, takes 60% of the total time spent in the function. By the way, there's a reasonable FIXME there about including container numbers in the name, but that's another matter.