Performance bottleneck? stmemref()

I have lusca (latest snapshot) running on server.
This server have some "specific" configuration required by environment.
After some time of work i notice it becomes slow. It is actually strange,
because server is enough powerful, Intel(R) Xeon(R) CPU           E5440  @
2.83GHz. Only one note and possible weak point, system is 32-bit. But since
i don't run real disk cache, it must not be important.
Also if i restart LUSCA it will work for 1-2 hours without any problem.

Proxy-Karam-Main ~ # cat /proc/2334/status
Name:   squid
State:  R (running)
Tgid:   2334
Pid:    2334
PPid:   2075
TracerPid:      0
Uid:    101     101     0       101
Gid:    101     101     101     101
FDSize: 65536
Groups: 101
VmPeak:  2041924 kB
VmSize:  2041924 kB
VmLck:         0 kB
VmHWM:   1890132 kB
VmRSS:   1890132 kB
VmData:  2038812 kB
VmStk:        84 kB
VmExe:       936 kB
VmLib:      1984 kB
VmPTE:      3716 kB
...
Cpus_allowed:   03
Cpus_allowed_list:      0-1

Here is more details about configuration:
Squid Cache: Version LUSCA_HEAD
configure options:  '--prefix=/usr' '--host=i686-pc-linux-gnu'
'--mandir=/usr/share/man' '--infodir=/usr/share/info'
'--datadir=/usr/share' '--sysconfdir=/etc' '--localstatedir=/var/lib'
'--sysconfdir=/etc/squid' '--libexecdir=/usr/libexec/squid'
'--localstatedir=/var' '--datadir=/usr/share/squid'
'--enable-removal-policies=lru,heap' '--disable-delay-pools'
'--with-pthreads' '--with-large-files' '--disable-ssl'
'--enable-storeio=coss,aufs,null' '--enable-async-io'
'--disable-linux-netfilter' '--build=i686-pc-linux-gnu'
'build_alias=i686-pc-linux-gnu' 'host_alias=i686-pc-linux-gnu'
'--disable-ident-lookups' '--with-maxfd=131000' '--disable-wccp'
'--disable-wccp2' '--without-snmp' '--enable-htcp' 'CFLAGS=-O3
-mtune=pentium3 -msse -msse2 -mfpmath=sse -fno-delete-null-pointer-checks
-g' '--disable-poll' '--disable-select' '--enable-epoll'

snips from portions of configs that matter
cache_mem 512 MB
maximum_object_size 128 MB
...
cache_replacement_policy heap GDSF

cache_dir null /tmp

cache_store_log none
...
client_persistent_connections on
server_persistent_connections on
pipeline_prefetch off
half_closed_clients off
#Tried to change this, doesnt matter
tcp_recv_bufsize 4 KB
#Tried to change this, doesnt matter
memory_pools_limit 128 MB
store_avg_object_size 4 KB
store_objects_per_bucket 5
load_check_stopen off

Some config options as i know useless with my null storage, but maybe it's
matter. If required i can send to private e-mail whole file, but i think
for example refresh_pattern doesn't matter.

Server load:
16074 sockets opened

17:53:05          PID    %usr %system  %guest    %CPU   CPU  Command
17:53:09         2334   16.83    7.92    0.00   24.75     1  squid
17:53:10         2334   17.00    8.00    0.00   25.00     1  squid
17:53:11         2334   15.00    9.00    0.00   24.00     1  squid

(squid locked to CPU by smp_affinity, so migration will not cause
additional cache misses)

Here is httping directly from host where this squid installed:
Proxy-Karam-Main ~ # httping -g http://194.146.153.2/
PING 194.146.153.2:80 (http://194.146.153.2/):
connected to 194.146.153.2:80, seq=0 time=59.89 ms
connected to 194.146.153.2:80, seq=1 time=59.93 ms
connected to 194.146.153.2:80, seq=2 time=60.55 ms
connected to 194.146.153.2:80, seq=3 time=58.09 ms
connected to 194.146.153.2:80, seq=4 time=61.24 ms
connected to 194.146.153.2:80, seq=5 time=58.03 ms
connected to 194.146.153.2:80, seq=6 time=57.91 ms
connected to 194.146.153.2:80, seq=7 time=58.10 ms
connected to 194.146.153.2:80, seq=8 time=60.46 ms
connected to 194.146.153.2:80, seq=9 time=57.01 ms
connected to 194.146.153.2:80, seq=10 time=61.06 ms
^C--- http://194.146.153.2/ ping statistics ---
11 connects, 11 ok, 0.00% failed
round-trip min/avg/max = 57.0/59.3/61.2 ms

And over proxy. Right now it is somehow ok, but in some moments reply time
will jump to 1-3 seconds.

Proxy-Karam-Main ~ # httping -g http://194.146.153.2/ -x 127.0.0.1:8080
Using proxyserver: 127.0.0.1:8080
PING 194.146.153.2:80 (http://194.146.153.2/):
connected to 194.146.153.2:80, seq=0 time=249.94 ms
connected to 194.146.153.2:80, seq=1 time=200.11 ms
connected to 194.146.153.2:80, seq=2 time=88.07 ms
connected to 194.146.153.2:80, seq=3 time=202.67 ms
connected to 194.146.153.2:80, seq=4 time=166.77 ms
connected to 194.146.153.2:80, seq=5 time=67.67 ms
connected to 194.146.153.2:80, seq=6 time=232.63 ms
^C--- http://194.146.153.2/ ping statistics ---
7 connects, 7 ok, 0.00% failed
round-trip min/avg/max = 67.7/172.5/249.9 ms

Here is stats for CPU
Proxy-Karam-Main ~ # mpstat -P 1 1
Linux 2.6.31.5-build-0049 (Proxy-Karam-Main)    11/01/09        _i686_  (8 CPU)

17:56:50     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal 
%guest   %idle
17:56:51       1   52.00    0.00   28.00    0.00    0.00   20.00    0.00  
 0.00    0.00
17:56:52       1   62.50    0.00   16.67    0.00    0.00   12.50    0.00  
 0.00    8.33
17:56:53       1   52.00    0.00   20.00    0.00    0.00    8.00    0.00  
 0.00   20.00
17:56:54       1   68.00    0.00   20.00    0.00    0.00   12.00    0.00  
 0.00    0.00
17:56:55       1   68.00    0.00   20.00    0.00    0.00   12.00    0.00  
 0.00    0.00
17:56:56       1   66.67    0.00   20.83    0.00    0.00   12.50    0.00  
 0.00    0.00

(there is noone else on this CPU)
The only thing worry me, pidstat gives inconsistend result, while CPU usage
at him low, at mpstat CPU is "full" (almost completely taken by user process).

If i do profile of squid process i will see top function

    14.32%    squid  /usr/sbin/squid            [.] stmemRef
                |
                |--99.95%-- storeClientCopy2
                |          |
                |          |--59.71%-- clientWriteComplete
                |          |          commHandleWrite
                |          |          comm_select
                |          |          main
                |          |          __libc_start_main
                |          |          0x804af01
                |          |
                |           --40.29%-- InvokeHandlers
                |                     |
                |                     |--99.97%-- storeBufferFlush
                |                     |          |
                |                     |          |--99.97%-- httpAppendBody
                |                     |          |          httpReadReply
                |                     |          |          comm_select
                |                     |          |          main
                |                     |          |          __libc_start_main
                |                     |          |          0x804af01
                |                     |           --0.03%-- [...]
                |                      --0.03%-- [...]
                 --0.05%-- [...]

in annotated stmemref (on this system i don't have sources, so i had to
compare with binary on host where i compiled it) i got "hot" spot:

   0.00 :         80d0223:        8b 9b 08 10 00 00            mov   
0x1008(%ebx),%ebx
    1.26 :         80d0229:        8b 55 e4                     mov   
-0x1c(%ebp),%edx
    0.00 :         80d022c:        01 f0                        add   
%esi,%eax
    0.09 :         80d022e:        89 45 e0                     mov   
%eax,-0x20(%ebp)
    0.00 :         80d0231:        11 fa                        adc   
%edi,%edx
    0.10 :         80d0233:        85 db                        test  
%ebx,%ebx
    0.17 :         80d0235:        89 55 e4                     mov   
%edx,-0x1c(%ebp)
    0.01 :         80d0238:        74 2e                        je    
80d0268 <stmemRef+0x158>
    0.09 :         80d023a:        8b b3 00 10 00 00            mov   
0x1000(%ebx),%esi
   here>>>> 95.57 :         80d0240:        8b 45 e0                    
mov    -0x20(%ebp),%eax
    0.01 :         80d0243:        8b 55 e4                     mov   
-0x1c(%ebp),%edx
    0.00 :         80d0246:        89 f7                        mov   
%esi,%edi
    0.29 :         80d0248:        c1 ff 1f                     sar   
$0x1f,%edi
    0.33 :         80d024b:        01 f0                        add   
%esi,%eax
    0.01 :         80d024d:        11 fa                        adc   
%edi,%edx
    0.34 :         80d024f:        39 55 d4                     cmp   
%edx,-0x2c(%ebp)
    0.17 :         80d0252:        7f cc                        jg    
80d0220 <stmemRef+0x110>
    0.24 :         80d0254:        0f 8c 26 ff ff ff            jl    
80d0180 <stmemRef+0x70>
    0.10 :         80d025a:        39 45 d0                     cmp   
%eax,-0x30(%ebp)
    0.00 :         80d025d:        73 c1                        jae   
80d0220 <stmemRef+0x110>
    0.00 :         80d025f:        90                           nop

According sources it is:

 80d0233:       85 db                   test   %ebx,%ebx
    debug(19, 3) ("stmemRef: offset %" PRINTF_OFF_T "; initial offset in
memory %d\n", offset, (int) mem->origin_offset);
    if (p == NULL)
        return 0;
    /* Seek our way into store */
    while ((t_off + p->len) <= offset) {
        t_off += p->len;
 80d0235:       89 55 e4                mov    %edx,-0x1c(%ebp)
        if (!p->next) {
 80d0238:       74 2e                   je     80d0268 <stmemRef+0x158>

    debug(19, 3) ("stmemRef: offset %" PRINTF_OFF_T "; initial offset in
memory %d\n", offset, (int) mem->origin_offset);
    if (p == NULL)
        return 0;
    /* Seek our way into store */
    while ((t_off + p->len) <= offset) {
 80d023a:       8b b3 00 10 00 00       mov    0x1000(%ebx),%esi
 80d0240:       8b 45 e0                mov    -0x20(%ebp),%eax
 80d0243:       8b 55 e4                mov    -0x1c(%ebp),%edx
 80d0246:       89 f7                   mov    %esi,%edi
 80d0248:       c1 ff 1f                sar    $0x1f,%edi
 80d024b:       01 f0                   add    %esi,%eax
 80d024d:       11 fa                   adc    %edi,%edx
 80d024f:       39 55 d4                cmp    %edx,-0x2c(%ebp)
 80d0252:       7f cc                   jg     80d0220 <stmemRef+0x110>
 80d0254:       0f 8c 26 ff ff ff       jl     80d0180 <stmemRef+0x70>
 80d025a:       39 45 d0                cmp    %eax,-0x30(%ebp)
 80d025d:       73 c1                   jae    80d0220 <stmemRef+0x110>
 80d025f:       90                      nop
 80d0260:       e9 1b ff ff ff          jmp    80d0180 <stmemRef+0x70>
 80d0265:       8d 76 00                lea    0x0(%esi),%esi

I try to profile by other hardware counters, and notice match in cache-misses:

perf record -f -g -p 2334 -c 10000 -e cache-misses
perf report
    31.64%    squid  /usr/sbin/squid            [.] stmemRef
                |
                |--69.71%-- clientWriteComplete
                |          commHandleWrite
                |          comm_select
                |          main
                |          __libc_start_main
                |          0x804af01
                |
                 --30.29%-- InvokeHandlers
                           storeBufferFlush
                           httpAppendBody
                           httpReadReply
                           comm_select
                           main
                           __libc_start_main
                           0x804af01

For example branch-misses is totally another hotspot, in acl.

I run also latencytop, to make sure squid is not blocked by some system
call, and seems everything fine there. 
Process squid (2334)                       Total:  84.1 msec
Scheduler: waiting for cpu                          0.9 msec         90.3 %
Waiting for event (epoll)                           0.9 msec          9.7 %

Maybe it will give some hints to optimize lusca?
If you need more information, let me know.
Original issue reported on code.google.com by nuclear...@gmail.com on 1 Nov 2009 at 4:17
google-code-export / lusca-cache

Performance bottleneck? stmemref() #74