XOR benchmarks - Githubissues

dgladkov commented 8 years ago

Followup of discussion in 04289cc13d6501153cab61e875d9c4e2ccdaa5d8

Python 2 bytearray mutations

== run_benchmark_process_xor.py
max = 2147483543
34.8218421936
0.414654970169

Python 3 bytearray mutations

== run_benchmark_process_xor.py
max = 2147483543
33.529802142000335
0.39478560099996685

Python 3 iterating bytes

== run_benchmark_process_xor.py
max = 2147483543
32.42236607199993
0.39769275300022855

Python 2 numpy bitwise_xor

== run_benchmark_process_xor.py
max = 2147483543
27.3349659443
0.424569129944

Python 3 numpy bitwise_xor

== run_benchmark_process_xor.py
max = 2147483543
24.56664780200026
0.4790448710000419

Conclusions:

On real world example difference between bytearray mutations and map over bytes for Python 3 is not worth maintaining the separate code path.
numpy's bitwise_xor gives significant performance increase, however it requires manual key wrapping as bitwise_xor expects arrays of the same size

As numpy contains C extensions and may require compilation, I suggest adding numpy as an optional dependency and falling back to native Python bytearray mutations + xor if numpy is not installed.

GreyCat commented 8 years ago

Great comparison, thanks!

I have yet another humble suggestion - is it possible to submit something like bitwise_xor_cyclic to numpy, and then add here? That would be probably make it work even faster?

There's also this large battle for XOR at SO. I can't estimate how hard is it to include & maintain such code in our library and/or if it's worth that.

dgladkov commented 8 years ago

A lot of answers suggested in that SO post are impractical (writing manual SSE2 instructions, depending on lots of third party libraries), but the general idea is that the only way to gain meaningful performance improvement is to write native code. The best solutions for Python are cffi and Cython. cffi is a better solution for small functions, while Cython is better when you need complex C<->Python interop, so I picked cffi for my test. My current C implementation looks like this (I know only basic C so this might be suboptimal):

void xor_one(const char *data, const unsigned char key, char *result)
{
    unsigned long data_len = strlen(data);
    unsigned long idx;
    for (idx = 0; idx < data_len; idx++)
    {
        result[idx] = data[idx] ^ key;
    }
}

void xor_many(const char *data, const char *key, char *result)
{
    unsigned long data_len = strlen(data);
    unsigned long key_len = strlen(key);
    unsigned long idx;
    for (idx = 0; idx < data_len; idx++)
    {
        result[idx] = data[idx] ^ key[idx % key_len];
    }
}

Benchmark results:

Python 2

== run_benchmark_process_xor.py
max = 2144703287
26.3223340511
0.44394993782

Python 3

== run_benchmark_process_xor.py
max = 2144703287
23.45923235600094
0.4029908089996752

Pros:

User doesn't need to install full numpy package
Slightly faster than numpy

Cons:

Needs compilation and binary distributions for different platforms and architectures

Submitting bitwise_xor_cyclic to numpy doesn't look realistic, their bitwise_xor is is just one of their supported array operations, so it requires arrays of the same size in any case.

In my opinion performance improvements of cffi code are not worth the build complexity it adds.

arekbulski commented 6 years ago

Adding Cython as either dependency or runtime is being debated in https://github.com/kaitai-io/kaitai_struct/issues/311 . If it gets added, reimplementing xor will be just a few-liner.

EDIT: Cython runtime was withdrawn.

kaitai-io / kaitai_struct_python_runtime

XOR benchmarks #8