phac-nml / biohansel

Rapidly subtype microbial genomes using single-nucleotide variant (SNV) subtyping schemes
Apache License 2.0
25 stars 7 forks source link

Faster reverse complement using translation table #118

Closed peterk87 closed 4 years ago

peterk87 commented 4 years ago

This change should make revcomp much quicker (37X; 231us vs 6us):

# create random nt sequence
arr = np.array(list('A'*1000+'G'*1000+'C'*1000+'T'*1000))
np.random.shuffle(arr)
seq = ''.join(list(arr))
# `seq` is a randomly shuffled 4k nt sequence

# original revcomp function in biohansel
NT_SUB = {x: y for x, y in zip('acgtrymkswhbvdnxACGTRYMKSWHBVDNX', 'tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX')}
def revcomp(s):
    return ''.join([NT_SUB[c] for c in s[::-1]])

# new proposed revcomp function
nt_comp_trans = str.maketrans('acgtrymkswhbvdnxACGTRYMKSWHBVDNX', 'tgcayrkmswdvbhnxTGCAYRKMSWDVBHNX')
def revcomp_trans(s):
    return s.translate(nt_comp_trans)[::-1]

# benchmark:

%%timeit 
revcomp(seq)
# 231 µs ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
revcomp_trans(seq)
# 6.19 µs ± 79.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# new revcomp function produces the same output
revcomp(seq) == revcomp_trans(seq)
# True
glabbe commented 4 years ago

Sounds good to me, thanks @peterk87 !