Open wrobell opened 4 years ago
string vs bytes, hex vs bin, I know this day will come :smiley:
I will think about this. Perhaps, this would be a major change for pyModeS 3.x in the future, since many dependency may break...
@xoolive, any thought?
Indeed, it would be great to clarify the use of data structures as it would be beneficial at the performance level as well.
We tried some performance improvement with Cython in the speedup branch and saw that the choice of the most relevant structures to support was at stake.
The lessons to take from this shot, imho, are:
There are a lot of projects using Cython (see uvloop or asyncpg for reference). I am not aware of any unsupported architectures. Personally, I am using it in a lot of my projects on x86_64 laptop and multiple ARM based Raspberry Pi (A, Zero, 3, 4). IMHO, I would stick just to Cython for certain functions and avoid code duplication.
(however, if you decide to provide two implementations, please let me know. I might have some optimization for cprNL in common.py)
For the moment, I merged only the Cythin code for the common
package (from the speedup branch). Those are the functions that are called all the time. It is still more than twice faster than pure python with the benchmark.
In the currently pyModeS version 2.x, even with the optional Cython c_common
module, all the inputs are still strings.
provide a vanilla Python API to the core decoding functions supporting bytes, str, hex, binascii, just anything... no questions asked.
I think this might be the most logical way to move forward. Maybe using a decorator function to check the inputs? I am not sure how much slower this is going to be...
On the compatibility and supported inputs. Looking at the API, I think first parameter of every function is a message.
Therefore, singledispatch
could be used to overload first parameter, i.e.
In [1]: from binascii import unhexlify
In [2]: from functools import singledispatch
In [3]: @singledispatch
: def decode(msg):
: raise ValueError('Message type {} not supported'.format(type(msg)))
:
In [6]: @decode.register(bytes)
: @decode.register(bytearray)
: def _decode_bin(msg):
: print('decoding type {}'.format(type(msg)))
:
In [7]: @decode.register(str)
: def _decode_str(msg):
: decode(unhexlify(msg))
:
In [8]: decode(b'abcd')
decoding type <class 'bytes'>
In [9]: decode('abcd') # will call _decode_str, then _decode_bin
decoding type <class 'bytes'>
In [10]: decode(bytearray(b'abcd'))
decoding type <class 'bytearray'>
This should allow backward compatibility as well.
Indeed, singledispatch
would be a reasonable way to address this though I guess the biggest chunk of the work would be to clarify types in the core decoding functions.
Also I believe it is always reasonable to keep a pure Python implementation as a fallback option. Even if it is always possible to compile the whole package for any architecture/platform/python version, a person in charge of an open source side project (i.e. a real day job and not so much time to devote to support) may feel iffy about opening the Pandora box and trying to provide a compiled version of the library for every possible combination.
Even now, the API can support both hex strings and bytes. Please see https://github.com/junzis/pyModeS/pull/67.
If possible, I would like to restart the discussion about this topic.
Another argument in favour of binary message processing:
# ensure the optimized version of `df` function is imported
from pyModeS.c_common import df as pymodes_df
from binascii import unhexlify
def df_bin(msg):
return min(24, (msg[0] & 0xf8) >> 3)
# string and binary version of the same message
msg = '904ca3a33219741c85465ae1b4c3'
msg_b = unhexlify(msg)
%timeit pymodes_df(msg)
718 ns ± 5.98 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each
%timeit df_bin(msg_b)
202 ns ± 2.79 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Over 3 times faster.
I am approaching 1 billion messages in my database. This would be almost 12 minutes vs 3.3 minutes to extract downlink format information.
@wrobell don't give up! I agree too... 🤣
This has to be added as a decorator to all current functions if we are going down this path.
I am wondering if the unhexlify would slow the execution?
My another comment:
min(24, (msg[0] & 0xf8) >> 3)
is very hard to understand for people not familiar with bitwise operations.
This also means we have to change almost every function in the decoder. That's a very large project :fearful:
Thanks to function overload we could always create optimized versions for different types of inputs - strings, bytes... and arrays of these, see below. Indeed, there is a lot of work, but you can always fall back on slower implementation until new version is provided. The real challenge is to design nice API.
I was going through pyModeS code and realized that the most efficient implementation should use numpy anyway. pyModeS depends on the library, but does not utilize its potential, IMHO.
I did some research, which lead me to implement this
https://gitlab.com/wrobell/vmodes/-/blob/master/vmodes/parser.py
The project includes a simplistic performance test script
$ PYTHONPATH=. ./scripts/perf-test
processing 6.0 million messages
vmodes: df: 0.02 sec, 2.66 ns per message
vmodes: typecode: 0.04 sec, 7.03 ns per message
vmodes: icao: 0.22 sec, 36.71 ns per message
vmodes: category: 0.04 sec, 5.93 ns per message
vmodes: callsign: 2.20 sec, 365.95 ns per message
vmodes total: 2.51 sec, 418.33 ns per message
pyModeS: df: 4.52 sec, 752.87 ns per message
pyModeS: typecode: 7.83 sec, 1305.10 ns per message
pyModeS: icao: 11.42 sec, 1903.81 ns per message
pyModeS: category: 8.81 sec, 1468.20 ns per message
vmodes: callsign: 15.47 sec, 2578.46 ns per message
pyModeS total: 48.05 sec, 8008.47 ns per message
The difference in processing of real life data is not as impressive, but still VModeS is 13 times faster (I will add appropriate scripts to above repo later).
I will close this ticket now. I am planning to explore ideas for high performance processing via VModeS project. If, at any stage, you decide that improving performance of pyModeS is an option, please let me know and I will try to contribute back (IMHO #67 is a starting point).
@wrobell, the performance looks quite promising. Would you be interested to work on a new pyModeS branch directly? This way, @xoolive and I can also help with integrating it with the rest of the code.
I will try to create some initial pull request in few days.
I am still exploring few options around NumPy's API.
When I receive ADS-B message from
dump1090
I am converting it to binary form withbinascii.unhexlify
. This is much more efficient to store than saving string version of messages.To decode the messages, I am reading them from a database in binary form, convert to string using
binascii.hexlify
and then usepyModeS
. But, internally,pyModeS
converts a message to binary form again (this or other way).It would be great if
pyModeS
API allowed to pass binary version of a message.Taking a quick look at the API, it seems that single dispatch with separate type registration for
str
andbytes
could do the trick?