Small efficiency improvement in the rbit24() utility function for CRC

mdxs commented 6 months ago

I noticed in the commit history that https://github.com/nccgroup/Sniffle/commit/503e794362e8069ee6905b8eef64ced2ceecb0a0 introduced the file python_cli/crc_ble.py in which there is a utility function rbit24() to support reversing the bits in a 24 bit CRC.

It has a comment/remark that it might not be the most efficient implementation: https://github.com/nccgroup/Sniffle/blob/5da809b960ab2d420b5280cb17ed9fd9ce5f6278/python_cli/sniffle/crc_ble.py#L41

However, I couldn't find any bitwise approaches that are faster. I tested for example:

def rbit24_bw(c):
    r, power = 0, 23
    while c:
        r += (c & 1) << power
        c = c >> 1
        power -= 1
    return r

def rbit24_bw2(c):
    r = 0
    for _ in range(24):
        r = (r << 1) | (c & 1)
        c >>= 1
    return r

But they turned out to be much slower.

While looking for alternative approaches, the following is (at least in my testing) consistently faster:

def rbit24_alt(c):
    return int(bin(c)[2:].rjust(24, '0')[::-1], 2)

When tested with the following standalone script (using Python 3.10.12 in XUbuntu guest VM):

# compare implementations of reverse bits function

def rbit24(c):
    return int(f'{c:024b}'[::-1], 2)

def rbit24_alt(c):
    return int(bin(c)[2:].rjust(24, '0')[::-1], 2)

def rbit24_bw(c):
    r, power = 0, 23
    while c:
        r += (c & 1) << power
        c = c >> 1
        power -= 1
    return r

def rbit24_bw2(c):
    r = 0
    for _ in range(24):
        r = (r << 1) | (c & 1)
        c >>= 1
    return r

print('input = 0x010203')
print('{:024b} {}'.format(0x010203, 0x010203))
print()

print('rbit24()')
r = rbit24(0x010203)
print('{:024b} {}'.format(r, r))
print('rbit24_alt()')
r = rbit24_alt(0x010203)
print('{:024b} {}'.format(r, r))
print('rbit24_bw()')
r = rbit24_bw(0x010203)
print('{:024b} {}'.format(r, r))
print('rbit24_bw2()')
r = rbit24_bw2(0x010203)
print('{:024b} {}'.format(r, r))
print()

import timeit
# perform each option 1 million times
n = 1000000
print('n={}', n)
print('rbit24()')
print(timeit.timeit('rbit24(0x010203)', number=n, globals=globals()))
print('rbit24_alt()')
print(timeit.timeit('rbit24_alt(0x010203)', number=n, globals=globals()))
print('rbit24_bw()')
print(timeit.timeit('rbit24_bw(0x010203)', number=n, globals=globals()))
print('rbit24_bw2()')
print(timeit.timeit('rbit24_bw2(0x010203)', number=n, globals=globals()))
print()

print('some tests...')
for x in [0x010203, 0x000000, 0xffffff, 0x111111, 0x123456, 0x00cafe]:
    assert rbit24(x) == rbit24_alt(x)
print('done tests')

I get the following results:

input = 0x010203
000000010000001000000011 66051

rbit24()
110000000100000010000000 12599424
rbit24_alt()
110000000100000010000000 12599424
rbit24_bw()
110000000100000010000000 12599424
rbit24_bw2()
110000000100000010000000 12599424

n={} 1000000
rbit24()
0.439537909001956
rbit24_alt()
0.37848864999978105
rbit24_bw()
1.84280500100067
rbit24_bw2()
2.2926251130011224

some tests...
done tests

Shows that the alternative rbit24_alt() function is slightly faster than the current approach.

Note that the number of tests with asserts in the above is very limited, but hopefully this will help you in assessing the efficiency of the current approach (which IMO is already pretty good) and the alternative (which is slightly less readable, but a bit faster).

sultanqasim commented 6 months ago

Thanks for looking into this. This is mainly a matter of looping and mathematical or bitwise operations being very slow/inefficent in Python, to the extent that using builtins that convert to strings end up being faster. Thus, there are unfortunately no easy ways to make this fast other than rewriting it in a compiled language like C (which I have no intention of doing since it would just make everything more difficult to set up and install).

While the ‘rbit24’ implementation is not particularly efficient, it’s readable as you say, and calls to it are few enough for it to not really matter. Originally the code was calling ‘rbit24’ multiple times for every single packet received, which I didn’t like and made me feel embarrassed about its inefficiency, so I changed it to calculate the CRC in a bit reversed form, and only perform a single ‘rbit24’ call when writing the packet to a PCAP file. This is “good enough” given the speed of modern computers and the data rate of Bluetooth LE; even an old PC from the 90s should be able to keep up without issue.

I’ll probably just leave ‘rbit24’ as is, since the speed up of your implementation is minor at the expense of making it less readable, and the function is now called infrequently enough for its inefficiency to be fine. Eventually this code will go away, as one of these days I plan to add support for receiving packets with invalid CRCs, and directly sending the CRC received over the air from the firmware to the host. Nonetheless, thanks for taking the time to investigate this and trying out different optimization strategies.

mdxs commented 6 months ago

Agree with your assessment; and like the plan to include the CRC as seen over-the-air to the host (for #53).

nccgroup / Sniffle

Small efficiency improvement in the rbit24() utility function for CRC #79