python / cpython

The Python programming language
https://www.python.org
Other
62.49k stars 30k forks source link

Accelerate DJBX33A hash algorithm by "unoptimizing" it #115704

Open PeterYang12 opened 7 months ago

PeterYang12 commented 7 months ago

Feature or enhancement

Proposal:

Hello CPython team! 👋 I saw Daniel Lemire's blog post: https://lemire.me/blog/2016/07/21/accelerating-php-hashing-by-unoptimizing-it/ By "unoptimizing" DJBX33A hash algorithm, it could get some performance gain. The idea has been implemented in the php interpreter. Is that possible also porting this optimization in Python?

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

serhiy-storchaka commented 7 months ago

Do you have any benchmarks that show the effect of this optimization in Python?

PeterYang12 commented 7 months ago

Do you have any benchmarks that show the effect of this optimization in Python?

Hi, I used benchmark tomli_loads, one of the benchmarks in pyperformance. It shows ~1.2% perf gain. My SUT is Intel 4th Xeon and OS is Ubuntu22.04.

PeterYang12 commented 7 months ago

Do you have any benchmarks that show the effect of this optimization in Python?

Hi, maintainer, you can follow the following steps:

# install pyperformance
python3 -m pip install pyperformance

# run and test
pyperformance run -p "$PYTHON_BINARY" -r -b tomli_loads
serhiy-storchaka commented 7 months ago

If it shows stable difference in macrobenchmarks, it should have much more difference (perhaps tens of percents) in microbenchmarks. Could you please create such microbenchmarks to demonstrate this? Something that hashes strings in tight loop.

PeterYang12 commented 7 months ago

If it shows stable difference in macrobenchmarks, it should have much more difference (perhaps tens of percents) in microbenchmarks. Could you please create such microbenchmarks to demonstrate this? Something that hashes strings in tight loop.

Of course. Thank you for your quick review. Let me show the demo after my weekend.

PeterYang12 commented 7 months ago

Thank you for your paitence. I've attached the microbenchmark written by myself. The patch shows ~30% on my SUT.

djbx33a_optimization.zip

serhiy-storchaka commented 7 months ago

Could you show a Python code?

jxu commented 1 month ago

If you are getting rid of Duff's device (which I support), why not get rid of unrolling the polynomial and let the compiler generate SIMD instructions for it?

PeterYang12 commented 1 month ago

If you are getting rid of Duff's device (which I support), why not get rid of unrolling the polynomial and let the compiler generate SIMD instructions for it?

Thank you for your question. I tried to use SIMD instructions, using intrinsics (SSE, avx256, avx512) before, but no obvious perf improvement. For avx512, it even got worse due to the CPU frequency drop. I haven't tried GCC options to do such operation. Do you have any good idea?

serhiy-storchaka commented 1 month ago

Please provide a Python demo to show the effect of this optimization for different examples.

My intuition says that the difference should be insignificant and hardly noticeable on Python level, especially taking into account that the hash is cached in the string object.

mdboom commented 4 weeks ago

I ran the linked PR (#115705) against the full pyperformance suite, and the change is basically below the threshold of noise, except maybe a small improvement on x86_64 Windows. That's not an argument for or against (it still might make sense to merge the PR with a clear microbenchmark result), but it's quite clear this doesn't have an overall impact in general.

PeterYang12 commented 4 weeks ago

I ran the linked PR (#115705) against the full pyperformance suite, and the change is basically below the threshold of noise, except maybe a small improvement on x86_64 Windows. That's not an argument for or against (it still might make sense to merge the PR with a clear microbenchmark result), but it's quite clear this doesn't have an overall impact in general.

Did you turn on the Macro? Like: CFLAGS="-g -DPy_HASH_CUTOFF=7" ../configure

mdboom commented 4 weeks ago

Did you turn on the Macro? Like: CFLAGS="-g -DPy_HASH_CUTOFF=7" ../configure

No, I missed that that was required. I will run again.

mdboom commented 3 weeks ago

With this flag, the overall results are still "within the noise", however, it does (as mentioned above) get between 3%-7% faster on the tomli_loads benchmark. Seems like a worthwhile improvement (but only with Py_HASH_CUTOFF set to 7 by default).

gpshead commented 3 weeks ago

If we're going to accept this as useful, we should figure out how to not require anyone to manually set that at build time. ie: ask & answer the period questions: Why does the conditional compilation path even exist? do we still need it? if so, for what and why?

serhiy-storchaka commented 3 weeks ago

tomli_loads is too large. We need simpler example for which the difference will be far larger than the noise. Without this, we do not have reason to change the code.

The code is always the compromise between readability/maintenability and efficiency. There is also a cost of changing the status quo. Note also that performance depends on the platform, so if we have an example that demonstrates the difference, it should be tested on different platforms (preferably on processors with different architecture, but even different cache size/generation/manufacturer can make a diffirence).