Function `zend_hash_find_bucket` consumes much time in WordPress workload

PeterYang12 commented 9 months ago

Description

Dear, maintainers: I am utilizing PHP-FPM and have profiled performance data for a WordPress workload. Within this benchmark, I've identified certain hotspots. The performance analysis from 'perf' highlights that functions such as zend_hash_find and zend_hash_find_known_hash are consuming a considerable amount of time. Upon investigation, I suspect that the core issue lies within the zend_hash_find_bucket function.

Upon inspecting the assembly code and annotating it, I noticed that certain 'mov' instructions are causing significant delays. To address this, I attempted to optimize the zend_hash_find_bucket function. I experimented with optimizations such as utilizing prefetch instructions and employing struct alignment. However, these optimizations did not seem to have a discernible impact on performance.

Percent│
       │
       │
       │    Disassembly of section .text:
       │
       │    0000000000532ae0 <zend_hash_find>:
       │    zend_hash_find():
  0.84 │      push %r12
  0.04 │      push %rbp
  0.01 │      mov  %rsi,%rbp
  2.52 │      push %rbx
       │    zend_string_hash_val():
  8.01 │      mov  0x8(%rsi),%rax
       │    zend_hash_find():
  0.02 │      mov  %rdi,%rbx
       │    zend_string_hash_val():
  0.12 │      test %rax,%rax
  0.14 │    ↓ je   78
       │    zend_hash_find_bucket():
  5.63 │13:   or   0xc(%rbx),%eax
  0.28 │      mov  0x10(%rbx),%r12
  0.16 │      cltq
 25.03 │      mov  (%r12,%rax,4),%ebx
  1.98 │      cmp  $0xffffffff,%ebx
  0.56 │    ↓ je   a0
  0.85 │      shl  $0x5,%rbx
  0.50 │      add  %r12,%rbx
 15.20 │      cmp  0x18(%rbx),%rbp
 12.97 │    ↓ jne  68
       │    zend_hash_find():
  1.41 │32:   mov  %rbx,%rax
  0.60 │      pop  %rbx
  0.03 │      pop  %rbp
  1.25 │      pop  %r12
  1.83 │    ← ret
       │      nop
       │    zend_hash_find_bucket():
  0.26 │40:   mov  0x18(%rbx),%rdi
  0.04 │      test %rdi,%rdi
       │    ↓ je   53
       │    zend_string_equal_content():
  0.01 │      mov  0x10(%rbp),%rax
  0.01 │      cmp  %rax,0x10(%rdi)
  5.03 │    ↓ je   90
       │    zend_hash_find_bucket():
  1.06 │53:   mov  0xc(%rbx),%ebx
  0.11 │      cmp  $0xffffffff,%ebx
  0.15 │    ↓ je   a0
  0.33 │      shl  $0x5,%rbx
  0.24 │      add  %r12,%rbx
  1.90 │      cmp  0x18(%rbx),%rbp
  2.69 │    ↑ je   32
  2.00 │68:   mov  0x8(%rbp),%rax
  0.24 │      cmp  %rax,0x10(%rbx)

Is this a known performance issue within the PHP community? I'm seeking suggestions or insights on potential avenues for optimization. Your feedback and suggestions would be greatly appreciated.

PHP Version

master

Operating System

No response

crrodriguez commented 9 months ago

Today the choosen hash algorithm is no good for 64 bit performance either..I think you could start there..switch to XXH3_64bits ideally make your toolchain pick the most optimial version available..then profile further..

nielsdos commented 9 months ago

I agree that the chosen hashing algorithm could be a part of this issue. I.e. the current djb33x algorithm is not the fastest on large data and also can lead to a worse hash bucket distribution than say xxhash. I know there is the simhash repo iirc that has a huge comparison list of hash function speed and quality. Also if the mov for memory load takes so much time the issue is probably related to caching too as you're hinting at. I knew this was hot code for wordpress and also tried some simple optimization tricks in the past to no avail unfortunately.

crrodriguez commented 9 months ago

I just tried replacing djb33x with XXH3_64bit when ZEND_ENABLE_ZVAL_LONG64 is defined since then zend_ulong is compatible with XXH64_hash_t , all works except, somehow all htmlentitites test fail because it seems something goes awry with resolve_named_entity_html .. probably a bug.

nielsdos commented 9 months ago

I had a similar issue when trying to replace djb with fnv1a. Maybe some hash is hardcoded. Did you make a performance measurement?

crrodriguez commented 9 months ago

I had a similar issue when trying to replace djb with fnv1a. Maybe some hash is hardcoded. Did you make a performance measurement?

Not until the test suites passes. :-) but will try to.

chen-hu-97 commented 9 months ago

... I knew this was hot code for wordpress and also tried some simple optimization tricks in the past to no avail unfortunately.

Hi @nielsdos, can you kindly share which methods you have tried? I see two main hotspots in zend_hash_find_bucket:

hotspot 1
```
idx = HT_HASH_EX(arData, nIndex);
25.03 │      mov  (%r12,%rax,4),%ebx
```
For this hotspot, it looks like mem load consume too much CPU. So I tried some prefetch methods but no gain. I also suspect that "unsigned to signed cast" may hurt perf and switch to pointer operation, but also no perf gain.

hospot 2

if (EXPECTED(p->key == key)) { /* check for the same interned string */
    return p;
}
15.20 │      cmp  0x18(%rbx),%rbp
12.97 │    ↓ jne  68

For this hotspot, I am not sure if macro fusion works well?

nielsdos commented 9 months ago

I also tried prefetching and aligning the data to cache line size. And tried a fast path for packed arrays (but that penalized the hash arrays).

PeterYang12 commented 9 months ago

I also tried prefetching and aligning the data to cache line size. And tried a fast path for packed arrays (but that penalized the hash arrays).

Same operation. We also did data alignment, see another issue: https://github.com/php/php-src/issues/12872 Did you see any performance gain?

nielsdos commented 9 months ago

I didn't see a noteworthy performance gain in my case, but it can of course be very hardware dependent. I did do a test too where the working set fit completely in the cache. I noticed there's still the 2 hotspots you also had. This suggests that even with a near perfect cache hit rate it's still hot.

crrodriguez commented 9 months ago

btw..which cpu is this..

I had a similar issue when trying to replace djb with fnv1a. Maybe some hash is hardcoded. Did you make a performance measurement?

html_table_gen.php relies on the fact the hash function is DJB "times 33" hash.

nielsdos commented 9 months ago

btw..which cpu is this..

I tested on my i7-4790.

html_table_gen.php relies on the fact the hash function is DJB "times 33" hash.

Great find! Makes sense.

MaxSem commented 9 months ago

I didn't see a noteworthy performance gain in my case, but it can of course be very hardware dependent. I did do a test too where the working set fit completely in the cache. I noticed there's still the 2 hotspots you also had. This suggests that even with a near perfect cache hit rate it's still hot.

What happens is a pipeline stall. The CPU fetches instructions and executes them out of order, but in cases where further execution depends on memory reads it has to stop until the read is done. In the assembly above, local variables are stored in registers so only 2 places cause stalls in the best case:

Loading idx from memory, its comparison with -1 and conditional jump
Loading p->key from memory, its comparison with key and conditional jump

Unless there is a tricky a way to replace 2 conditional jumps that depend on reads with 1, the question might be reformulated as "how to reduce the number of HashMap lookups?"

For unordered hashmaps, I think best case scenario might be 1 stall, however having two types of hashes instead of one sounds like a world of pain.

chen-hu-97 commented 8 months ago

Loading idx from memory, its comparison with -1 and conditional jump

Loading p->key from memory, its comparison with key and conditional jump

@MaxSem neat! This aligns with what perf reports and CPU stalls due to memory read. A good hash algorithm distributes evenly(scattered) which is not friendly to cache. So we always suffer more cache miss on hash than other cases? Software prefetch may mitigate this, but:

People now think sw prefetch is toxic. They think HW prefetcher is better. See https://lwn.net/Articles/444336/
If we want to prefetch, we should issue the prefetch in the right time before call zend_hash_find_bucket. However, there are so many places to call it and is hard to add the prefetch (also the necessary calculation of the address)

php / php-src