kaitai-io / kaitai_struct_python_runtime

Kaitai Struct: runtime for Python
MIT License
93 stars 31 forks source link

XOR benchmarks #8

Open dgladkov opened 8 years ago

dgladkov commented 8 years ago

Followup of discussion in 04289cc13d6501153cab61e875d9c4e2ccdaa5d8

Python 2 bytearray mutations

== run_benchmark_process_xor.py
max = 2147483543
34.8218421936
0.414654970169

Python 3 bytearray mutations

== run_benchmark_process_xor.py
max = 2147483543
33.529802142000335
0.39478560099996685

Python 3 iterating bytes

== run_benchmark_process_xor.py
max = 2147483543
32.42236607199993
0.39769275300022855

Python 2 numpy bitwise_xor

== run_benchmark_process_xor.py
max = 2147483543
27.3349659443
0.424569129944

Python 3 numpy bitwise_xor

== run_benchmark_process_xor.py
max = 2147483543
24.56664780200026
0.4790448710000419

Conclusions:

As numpy contains C extensions and may require compilation, I suggest adding numpy as an optional dependency and falling back to native Python bytearray mutations + xor if numpy is not installed.

GreyCat commented 8 years ago

Great comparison, thanks!

I have yet another humble suggestion - is it possible to submit something like bitwise_xor_cyclic to numpy, and then add here? That would be probably make it work even faster?

There's also this large battle for XOR at SO. I can't estimate how hard is it to include & maintain such code in our library and/or if it's worth that.

dgladkov commented 8 years ago

A lot of answers suggested in that SO post are impractical (writing manual SSE2 instructions, depending on lots of third party libraries), but the general idea is that the only way to gain meaningful performance improvement is to write native code. The best solutions for Python are cffi and Cython. cffi is a better solution for small functions, while Cython is better when you need complex C<->Python interop, so I picked cffi for my test. My current C implementation looks like this (I know only basic C so this might be suboptimal):

void xor_one(const char *data, const unsigned char key, char *result)
{
    unsigned long data_len = strlen(data);
    unsigned long idx;
    for (idx = 0; idx < data_len; idx++)
    {
        result[idx] = data[idx] ^ key;
    }
}

void xor_many(const char *data, const char *key, char *result)
{
    unsigned long data_len = strlen(data);
    unsigned long key_len = strlen(key);
    unsigned long idx;
    for (idx = 0; idx < data_len; idx++)
    {
        result[idx] = data[idx] ^ key[idx % key_len];
    }
}

Benchmark results:

Python 2

== run_benchmark_process_xor.py
max = 2144703287
26.3223340511
0.44394993782

Python 3

== run_benchmark_process_xor.py
max = 2144703287
23.45923235600094
0.4029908089996752

Pros:

Cons:

Submitting bitwise_xor_cyclic to numpy doesn't look realistic, their bitwise_xor is is just one of their supported array operations, so it requires arrays of the same size in any case.

In my opinion performance improvements of cffi code are not worth the build complexity it adds.

arekbulski commented 6 years ago

Adding Cython as either dependency or runtime is being debated in https://github.com/kaitai-io/kaitai_struct/issues/311 . If it gets added, reimplementing xor will be just a few-liner.

EDIT: Cython runtime was withdrawn.