Open PeterYang12 opened 9 months ago
Today the choosen hash algorithm is no good for 64 bit performance either..I think you could start there..switch to XXH3_64bits ideally make your toolchain pick the most optimial version available..then profile further..
I agree that the chosen hashing algorithm could be a part of this issue. I.e. the current djb33x algorithm is not the fastest on large data and also can lead to a worse hash bucket distribution than say xxhash. I know there is the simhash repo iirc that has a huge comparison list of hash function speed and quality. Also if the mov for memory load takes so much time the issue is probably related to caching too as you're hinting at. I knew this was hot code for wordpress and also tried some simple optimization tricks in the past to no avail unfortunately.
I just tried replacing djb33x with XXH3_64bit when ZEND_ENABLE_ZVAL_LONG64 is defined since then zend_ulong is compatible with XXH64_hash_t , all works except, somehow all htmlentitites test fail because it seems something goes awry with resolve_named_entity_html .. probably a bug.
I had a similar issue when trying to replace djb with fnv1a. Maybe some hash is hardcoded. Did you make a performance measurement?
I had a similar issue when trying to replace djb with fnv1a. Maybe some hash is hardcoded. Did you make a performance measurement?
Not until the test suites passes. :-) but will try to.
... I knew this was hot code for wordpress and also tried some simple optimization tricks in the past to no avail unfortunately.
Hi @nielsdos, can you kindly share which methods you have tried? I see two main hotspots in zend_hash_find_bucket:
hotspot 1
idx = HT_HASH_EX(arData, nIndex);
25.03 │ mov (%r12,%rax,4),%ebx
For this hotspot, it looks like mem load consume too much CPU. So I tried some prefetch methods but no gain. I also suspect that "unsigned to signed cast" may hurt perf and switch to pointer operation, but also no perf gain.
hospot 2
if (EXPECTED(p->key == key)) { /* check for the same interned string */
return p;
}
15.20 │ cmp 0x18(%rbx),%rbp
12.97 │ ↓ jne 68
For this hotspot, I am not sure if macro fusion works well?
I also tried prefetching and aligning the data to cache line size. And tried a fast path for packed arrays (but that penalized the hash arrays).
I also tried prefetching and aligning the data to cache line size. And tried a fast path for packed arrays (but that penalized the hash arrays).
Same operation. We also did data alignment, see another issue: https://github.com/php/php-src/issues/12872 Did you see any performance gain?
I didn't see a noteworthy performance gain in my case, but it can of course be very hardware dependent. I did do a test too where the working set fit completely in the cache. I noticed there's still the 2 hotspots you also had. This suggests that even with a near perfect cache hit rate it's still hot.
btw..which cpu is this..
I had a similar issue when trying to replace djb with fnv1a. Maybe some hash is hardcoded. Did you make a performance measurement?
html_table_gen.php relies on the fact the hash function is DJB "times 33" hash.
btw..which cpu is this..
I tested on my i7-4790.
html_table_gen.php relies on the fact the hash function is DJB "times 33" hash.
Great find! Makes sense.
I didn't see a noteworthy performance gain in my case, but it can of course be very hardware dependent. I did do a test too where the working set fit completely in the cache. I noticed there's still the 2 hotspots you also had. This suggests that even with a near perfect cache hit rate it's still hot.
What happens is a pipeline stall. The CPU fetches instructions and executes them out of order, but in cases where further execution depends on memory reads it has to stop until the read is done. In the assembly above, local variables are stored in registers so only 2 places cause stalls in the best case:
idx
from memory, its comparison with -1 and conditional jumpp->key
from memory, its comparison with key
and conditional jumpUnless there is a tricky a way to replace 2 conditional jumps that depend on reads with 1, the question might be reformulated as "how to reduce the number of HashMap lookups?"
For unordered hashmaps, I think best case scenario might be 1 stall, however having two types of hashes instead of one sounds like a world of pain.
- Loading
idx
from memory, its comparison with -1 and conditional jump- Loading
p->key
from memory, its comparison withkey
and conditional jump
@MaxSem neat!
This aligns with what perf
reports and CPU stalls due to memory read. A good hash algorithm distributes evenly(scattered) which is not friendly to cache. So we always suffer more cache miss on hash than other cases?
Software prefetch may mitigate this, but:
zend_hash_find_bucket
. However, there are so many places to call it and is hard to add the prefetch (also the necessary calculation of the address)
Description
Dear, maintainers: I am utilizing PHP-FPM and have profiled performance data for a WordPress workload. Within this benchmark, I've identified certain hotspots. The performance analysis from 'perf' highlights that functions such as
zend_hash_find
andzend_hash_find_known_hash
are consuming a considerable amount of time. Upon investigation, I suspect that the core issue lies within thezend_hash_find_bucket
function.Upon inspecting the assembly code and annotating it, I noticed that certain 'mov' instructions are causing significant delays. To address this, I attempted to optimize the
zend_hash_find_bucket
function. I experimented with optimizations such as utilizing prefetch instructions and employing struct alignment. However, these optimizations did not seem to have a discernible impact on performance.Is this a known performance issue within the PHP community? I'm seeking suggestions or insights on potential avenues for optimization. Your feedback and suggestions would be greatly appreciated.
PHP Version
master
Operating System
No response