Open PeterYang12 opened 7 months ago
Do you have any benchmarks that show the effect of this optimization in Python?
Do you have any benchmarks that show the effect of this optimization in Python?
Hi, I used benchmark tomli_loads, one of the benchmarks in pyperformance. It shows ~1.2% perf gain. My SUT is Intel 4th Xeon and OS is Ubuntu22.04.
Do you have any benchmarks that show the effect of this optimization in Python?
Hi, maintainer, you can follow the following steps:
# install pyperformance
python3 -m pip install pyperformance
# run and test
pyperformance run -p "$PYTHON_BINARY" -r -b tomli_loads
If it shows stable difference in macrobenchmarks, it should have much more difference (perhaps tens of percents) in microbenchmarks. Could you please create such microbenchmarks to demonstrate this? Something that hashes strings in tight loop.
If it shows stable difference in macrobenchmarks, it should have much more difference (perhaps tens of percents) in microbenchmarks. Could you please create such microbenchmarks to demonstrate this? Something that hashes strings in tight loop.
Of course. Thank you for your quick review. Let me show the demo after my weekend.
Thank you for your paitence. I've attached the microbenchmark written by myself. The patch shows ~30% on my SUT.
Could you show a Python code?
If you are getting rid of Duff's device (which I support), why not get rid of unrolling the polynomial and let the compiler generate SIMD instructions for it?
If you are getting rid of Duff's device (which I support), why not get rid of unrolling the polynomial and let the compiler generate SIMD instructions for it?
Thank you for your question. I tried to use SIMD instructions, using intrinsics (SSE, avx256, avx512) before, but no obvious perf improvement. For avx512, it even got worse due to the CPU frequency drop. I haven't tried GCC options to do such operation. Do you have any good idea?
Please provide a Python demo to show the effect of this optimization for different examples.
My intuition says that the difference should be insignificant and hardly noticeable on Python level, especially taking into account that the hash is cached in the string object.
I ran the linked PR (#115705) against the full pyperformance suite, and the change is basically below the threshold of noise, except maybe a small improvement on x86_64 Windows. That's not an argument for or against (it still might make sense to merge the PR with a clear microbenchmark result), but it's quite clear this doesn't have an overall impact in general.
I ran the linked PR (#115705) against the full pyperformance suite, and the change is basically below the threshold of noise, except maybe a small improvement on x86_64 Windows. That's not an argument for or against (it still might make sense to merge the PR with a clear microbenchmark result), but it's quite clear this doesn't have an overall impact in general.
Did you turn on the Macro? Like: CFLAGS="-g -DPy_HASH_CUTOFF=7" ../configure
Did you turn on the Macro? Like: CFLAGS="-g -DPy_HASH_CUTOFF=7" ../configure
No, I missed that that was required. I will run again.
With this flag, the overall results are still "within the noise", however, it does (as mentioned above) get between 3%-7% faster on the tomli_loads
benchmark. Seems like a worthwhile improvement (but only with Py_HASH_CUTOFF
set to 7 by default).
If we're going to accept this as useful, we should figure out how to not require anyone to manually set that at build time. ie: ask & answer the period questions: Why does the conditional compilation path even exist? do we still need it? if so, for what and why?
tomli_loads
is too large. We need simpler example for which the difference will be far larger than the noise. Without this, we do not have reason to change the code.
The code is always the compromise between readability/maintenability and efficiency. There is also a cost of changing the status quo. Note also that performance depends on the platform, so if we have an example that demonstrates the difference, it should be tested on different platforms (preferably on processors with different architecture, but even different cache size/generation/manufacturer can make a diffirence).
Feature or enhancement
Proposal:
Hello CPython team! 👋 I saw Daniel Lemire's blog post: https://lemire.me/blog/2016/07/21/accelerating-php-hashing-by-unoptimizing-it/ By "unoptimizing" DJBX33A hash algorithm, it could get some performance gain. The idea has been implemented in the php interpreter. Is that possible also porting this optimization in Python?
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response
Linked PRs