Performance bottleneck? stmemref()

GoogleCodeExporter commented 9 years ago

I have lusca (latest snapshot) running on server.
This server have some "specific" configuration required by environment.
After some time of work i notice it becomes slow. It is actually strange,
because server is enough powerful, Intel(R) Xeon(R) CPU           E5440  @
2.83GHz. Only one note and possible weak point, system is 32-bit. But since
i don't run real disk cache, it must not be important.
Also if i restart LUSCA it will work for 1-2 hours without any problem.

Proxy-Karam-Main ~ # cat /proc/2334/status
Name:   squid
State:  R (running)
Tgid:   2334
Pid:    2334
PPid:   2075
TracerPid:      0
Uid:    101     101     0       101
Gid:    101     101     101     101
FDSize: 65536
Groups: 101
VmPeak:  2041924 kB
VmSize:  2041924 kB
VmLck:         0 kB
VmHWM:   1890132 kB
VmRSS:   1890132 kB
VmData:  2038812 kB
VmStk:        84 kB
VmExe:       936 kB
VmLib:      1984 kB
VmPTE:      3716 kB
...
Cpus_allowed:   03
Cpus_allowed_list:      0-1

Here is more details about configuration:
Squid Cache: Version LUSCA_HEAD
configure options:  '--prefix=/usr' '--host=i686-pc-linux-gnu'
'--mandir=/usr/share/man' '--infodir=/usr/share/info'
'--datadir=/usr/share' '--sysconfdir=/etc' '--localstatedir=/var/lib'
'--sysconfdir=/etc/squid' '--libexecdir=/usr/libexec/squid'
'--localstatedir=/var' '--datadir=/usr/share/squid'
'--enable-removal-policies=lru,heap' '--disable-delay-pools'
'--with-pthreads' '--with-large-files' '--disable-ssl'
'--enable-storeio=coss,aufs,null' '--enable-async-io'
'--disable-linux-netfilter' '--build=i686-pc-linux-gnu'
'build_alias=i686-pc-linux-gnu' 'host_alias=i686-pc-linux-gnu'
'--disable-ident-lookups' '--with-maxfd=131000' '--disable-wccp'
'--disable-wccp2' '--without-snmp' '--enable-htcp' 'CFLAGS=-O3
-mtune=pentium3 -msse -msse2 -mfpmath=sse -fno-delete-null-pointer-checks
-g' '--disable-poll' '--disable-select' '--enable-epoll'

snips from portions of configs that matter
cache_mem 512 MB
maximum_object_size 128 MB
...
cache_replacement_policy heap GDSF

cache_dir null /tmp

cache_store_log none
...
client_persistent_connections on
server_persistent_connections on
pipeline_prefetch off
half_closed_clients off
#Tried to change this, doesnt matter
tcp_recv_bufsize 4 KB
#Tried to change this, doesnt matter
memory_pools_limit 128 MB
store_avg_object_size 4 KB
store_objects_per_bucket 5
load_check_stopen off

Some config options as i know useless with my null storage, but maybe it's
matter. If required i can send to private e-mail whole file, but i think
for example refresh_pattern doesn't matter.

Server load:
16074 sockets opened

17:53:05          PID    %usr %system  %guest    %CPU   CPU  Command
17:53:09         2334   16.83    7.92    0.00   24.75     1  squid
17:53:10         2334   17.00    8.00    0.00   25.00     1  squid
17:53:11         2334   15.00    9.00    0.00   24.00     1  squid

(squid locked to CPU by smp_affinity, so migration will not cause
additional cache misses)

Here is httping directly from host where this squid installed:
Proxy-Karam-Main ~ # httping -g http://194.146.153.2/
PING 194.146.153.2:80 (http://194.146.153.2/):
connected to 194.146.153.2:80, seq=0 time=59.89 ms
connected to 194.146.153.2:80, seq=1 time=59.93 ms
connected to 194.146.153.2:80, seq=2 time=60.55 ms
connected to 194.146.153.2:80, seq=3 time=58.09 ms
connected to 194.146.153.2:80, seq=4 time=61.24 ms
connected to 194.146.153.2:80, seq=5 time=58.03 ms
connected to 194.146.153.2:80, seq=6 time=57.91 ms
connected to 194.146.153.2:80, seq=7 time=58.10 ms
connected to 194.146.153.2:80, seq=8 time=60.46 ms
connected to 194.146.153.2:80, seq=9 time=57.01 ms
connected to 194.146.153.2:80, seq=10 time=61.06 ms
^C--- http://194.146.153.2/ ping statistics ---
11 connects, 11 ok, 0.00% failed
round-trip min/avg/max = 57.0/59.3/61.2 ms

And over proxy. Right now it is somehow ok, but in some moments reply time
will jump to 1-3 seconds.

Proxy-Karam-Main ~ # httping -g http://194.146.153.2/ -x 127.0.0.1:8080
Using proxyserver: 127.0.0.1:8080
PING 194.146.153.2:80 (http://194.146.153.2/):
connected to 194.146.153.2:80, seq=0 time=249.94 ms
connected to 194.146.153.2:80, seq=1 time=200.11 ms
connected to 194.146.153.2:80, seq=2 time=88.07 ms
connected to 194.146.153.2:80, seq=3 time=202.67 ms
connected to 194.146.153.2:80, seq=4 time=166.77 ms
connected to 194.146.153.2:80, seq=5 time=67.67 ms
connected to 194.146.153.2:80, seq=6 time=232.63 ms
^C--- http://194.146.153.2/ ping statistics ---
7 connects, 7 ok, 0.00% failed
round-trip min/avg/max = 67.7/172.5/249.9 ms

Here is stats for CPU
Proxy-Karam-Main ~ # mpstat -P 1 1
Linux 2.6.31.5-build-0049 (Proxy-Karam-Main)    11/01/09        _i686_  (8 CPU)

17:56:50     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal 
%guest   %idle
17:56:51       1   52.00    0.00   28.00    0.00    0.00   20.00    0.00  
 0.00    0.00
17:56:52       1   62.50    0.00   16.67    0.00    0.00   12.50    0.00  
 0.00    8.33
17:56:53       1   52.00    0.00   20.00    0.00    0.00    8.00    0.00  
 0.00   20.00
17:56:54       1   68.00    0.00   20.00    0.00    0.00   12.00    0.00  
 0.00    0.00
17:56:55       1   68.00    0.00   20.00    0.00    0.00   12.00    0.00  
 0.00    0.00
17:56:56       1   66.67    0.00   20.83    0.00    0.00   12.50    0.00  
 0.00    0.00

(there is noone else on this CPU)
The only thing worry me, pidstat gives inconsistend result, while CPU usage
at him low, at mpstat CPU is "full" (almost completely taken by user process).

If i do profile of squid process i will see top function

    14.32%    squid  /usr/sbin/squid            [.] stmemRef
                |
                |--99.95%-- storeClientCopy2
                |          |
                |          |--59.71%-- clientWriteComplete
                |          |          commHandleWrite
                |          |          comm_select
                |          |          main
                |          |          __libc_start_main
                |          |          0x804af01
                |          |
                |           --40.29%-- InvokeHandlers
                |                     |
                |                     |--99.97%-- storeBufferFlush
                |                     |          |
                |                     |          |--99.97%-- httpAppendBody
                |                     |          |          httpReadReply
                |                     |          |          comm_select
                |                     |          |          main
                |                     |          |          __libc_start_main
                |                     |          |          0x804af01
                |                     |           --0.03%-- [...]
                |                      --0.03%-- [...]
                 --0.05%-- [...]

in annotated stmemref (on this system i don't have sources, so i had to
compare with binary on host where i compiled it) i got "hot" spot:

   0.00 :         80d0223:        8b 9b 08 10 00 00            mov   
0x1008(%ebx),%ebx
    1.26 :         80d0229:        8b 55 e4                     mov   
-0x1c(%ebp),%edx
    0.00 :         80d022c:        01 f0                        add   
%esi,%eax
    0.09 :         80d022e:        89 45 e0                     mov   
%eax,-0x20(%ebp)
    0.00 :         80d0231:        11 fa                        adc   
%edi,%edx
    0.10 :         80d0233:        85 db                        test  
%ebx,%ebx
    0.17 :         80d0235:        89 55 e4                     mov   
%edx,-0x1c(%ebp)
    0.01 :         80d0238:        74 2e                        je    
80d0268 <stmemRef+0x158>
    0.09 :         80d023a:        8b b3 00 10 00 00            mov   
0x1000(%ebx),%esi
   here>>>> 95.57 :         80d0240:        8b 45 e0                    
mov    -0x20(%ebp),%eax
    0.01 :         80d0243:        8b 55 e4                     mov   
-0x1c(%ebp),%edx
    0.00 :         80d0246:        89 f7                        mov   
%esi,%edi
    0.29 :         80d0248:        c1 ff 1f                     sar   
$0x1f,%edi
    0.33 :         80d024b:        01 f0                        add   
%esi,%eax
    0.01 :         80d024d:        11 fa                        adc   
%edi,%edx
    0.34 :         80d024f:        39 55 d4                     cmp   
%edx,-0x2c(%ebp)
    0.17 :         80d0252:        7f cc                        jg    
80d0220 <stmemRef+0x110>
    0.24 :         80d0254:        0f 8c 26 ff ff ff            jl    
80d0180 <stmemRef+0x70>
    0.10 :         80d025a:        39 45 d0                     cmp   
%eax,-0x30(%ebp)
    0.00 :         80d025d:        73 c1                        jae   
80d0220 <stmemRef+0x110>
    0.00 :         80d025f:        90                           nop

According sources it is:

 80d0233:       85 db                   test   %ebx,%ebx
    debug(19, 3) ("stmemRef: offset %" PRINTF_OFF_T "; initial offset in
memory %d\n", offset, (int) mem->origin_offset);
    if (p == NULL)
        return 0;
    /* Seek our way into store */
    while ((t_off + p->len) <= offset) {
        t_off += p->len;
 80d0235:       89 55 e4                mov    %edx,-0x1c(%ebp)
        if (!p->next) {
 80d0238:       74 2e                   je     80d0268 <stmemRef+0x158>

    debug(19, 3) ("stmemRef: offset %" PRINTF_OFF_T "; initial offset in
memory %d\n", offset, (int) mem->origin_offset);
    if (p == NULL)
        return 0;
    /* Seek our way into store */
    while ((t_off + p->len) <= offset) {
 80d023a:       8b b3 00 10 00 00       mov    0x1000(%ebx),%esi
 80d0240:       8b 45 e0                mov    -0x20(%ebp),%eax
 80d0243:       8b 55 e4                mov    -0x1c(%ebp),%edx
 80d0246:       89 f7                   mov    %esi,%edi
 80d0248:       c1 ff 1f                sar    $0x1f,%edi
 80d024b:       01 f0                   add    %esi,%eax
 80d024d:       11 fa                   adc    %edi,%edx
 80d024f:       39 55 d4                cmp    %edx,-0x2c(%ebp)
 80d0252:       7f cc                   jg     80d0220 <stmemRef+0x110>
 80d0254:       0f 8c 26 ff ff ff       jl     80d0180 <stmemRef+0x70>
 80d025a:       39 45 d0                cmp    %eax,-0x30(%ebp)
 80d025d:       73 c1                   jae    80d0220 <stmemRef+0x110>
 80d025f:       90                      nop
 80d0260:       e9 1b ff ff ff          jmp    80d0180 <stmemRef+0x70>
 80d0265:       8d 76 00                lea    0x0(%esi),%esi

I try to profile by other hardware counters, and notice match in cache-misses:

perf record -f -g -p 2334 -c 10000 -e cache-misses
perf report
    31.64%    squid  /usr/sbin/squid            [.] stmemRef
                |
                |--69.71%-- clientWriteComplete
                |          commHandleWrite
                |          comm_select
                |          main
                |          __libc_start_main
                |          0x804af01
                |
                 --30.29%-- InvokeHandlers
                           storeBufferFlush
                           httpAppendBody
                           httpReadReply
                           comm_select
                           main
                           __libc_start_main
                           0x804af01

For example branch-misses is totally another hotspot, in acl.

I run also latencytop, to make sure squid is not blocked by some system
call, and seems everything fine there. 
Process squid (2334)                       Total:  84.1 msec
Scheduler: waiting for cpu                          0.9 msec         90.3 %
Waiting for event (epoll)                           0.9 msec          9.7 %

Maybe it will give some hints to optimize lusca?
If you need more information, let me know.

Original issue reported on code.google.com by nuclear...@gmail.com on 1 Nov 2009 at 4:16

GoogleCodeExporter commented 9 years ago

Hm, how big are your objects in memory? That is the key thing.

Lusca and Squid-2 have poor scalability when large objects are cached inside 
Lusca/Squid-2 itself. Squid-3 uses 
a tree instead which is slower than a list but it makes it quicker to do 
lookups.

I know roughly what I have to do to make Lusca perform well in that case but it 
is a question of time. :) (Not 
much time would be needed; just -some- time..)

Original comment by adrian.c...@gmail.com on 2 Nov 2009 at 12:23

Added labels: Version-Head
Removed labels: Version-1.0

GoogleCodeExporter commented 9 years ago

Well, it will sound strange, but i didn't expect things to be cached on this 
server. 
I didn't block caching intentionally, but sure i will, if it will help.

It's default, i guess "maximum_object_size_in_memory 8 KB".
But probably a lot of objects cached, since 512MB is large value. Decreasing
cache_mem to much lower value will help?

Original comment by nuclear...@gmail.com on 2 Nov 2009 at 12:33

GoogleCodeExporter commented 9 years ago

No, the size of cache_mem won't make a difference. only the object size.

Very interesting though; let me have a think about it and I'll get back to you.

Original comment by adrian.c...@gmail.com on 2 Nov 2009 at 1:31

GoogleCodeExporter commented 9 years ago

If required, i can give access to this host, if you need more info. Sure i can
provide anything else you want, including i can try to generate core dump for 
example
or etc.

Only one note - it is "semi-embedded", busybox+glibc, almost everything in 
initramfs
just to run things from USB flash. So not much tools is available, but i can 
compile
most of them if needed.

I can solve my personal case, by restarting lusca (i almost won't loose 
anything, i
can run redundant host/port for this moment), but i guess it will be nice to to 
solve
such issue, and not everybody can do same with their proxy :-)

Original comment by nuclear...@gmail.com on 2 Nov 2009 at 1:50

GoogleCodeExporter commented 9 years ago

int
stmemRef(const mem_hdr * mem, squid_off_t offset, mem_node_ref * r)
{
    mem_node *p = mem->head;
    volatile squid_off_t t_off = mem->origin_offset;'

Is volatile really required there?
It is probably destroying all optimizations in this part of code.

Original comment by nuclear...@gmail.com on 11 Nov 2009 at 9:25

GoogleCodeExporter commented 9 years ago

Who knows; I've committed something to disable it. try just removing volatile, 
recompiling and see what it does 
for performance.

Original comment by adrian.c...@gmail.com on 12 Nov 2009 at 4:18

GoogleCodeExporter commented 9 years ago

Seems performance gain is minor by changing volatile, but it is not crashing, 
and
maybe no reason to keep it, if it is not used.

What is other interesting thing i found:

if i have null disk storage, 
cache_dir null /tmp

cache_mem 4 MB
maximum_object_size 128M

and i doesn't define "maximum_object_size_in_memory", in config it is defined 
as:
#Default:
# maximum_object_size_in_memory 8 KB

Seems when lusca fetching large object, it is keeping whole object in memory 
(even it
can't be used for caching in any matter, after it is retrieved). So after while 
seems
"slow" clients who is fetching big objects killing whole proxy performance 
(because
memory which stmemref walking become extremely large), and VSZ (process size?)
reaching 2GB. 

As soon as i change to "maximum_object_size 1 MB" situation significantly 
changed,
now process size 351m, and performance is MUCH better.

Original comment by nuclear...@gmail.com on 12 Nov 2009 at 6:46

GoogleCodeExporter commented 9 years ago

That is a much, much more likely candidate for there being performance issues 
with stmemRef().

I'll look into it. Thanks!

Original comment by adrian.c...@gmail.com on 13 Nov 2009 at 1:59

GoogleCodeExporter commented 9 years ago

Aha! Reproduced!

KEY 8AD0F1F875004EE3313322F7E0302E5F
        GET http://192.168.11.2/test.64m
        STORE_PENDING NOT_IN_MEMORY SWAPOUT_NONE PING_DONE   
        CACHABLE,DISPATCHED,VALIDATED
        LV:1265803922 LU:1265803947 LM:1246070610 EX:-1       
        4 locks, 1 clients, 2 refs
        Swap Dir -1, File 0XFFFFFFFF
        inmem_lo: 0
        inmem_hi: 11600487
        swapout: 0 bytes queued
        Client #0, 0x0
                copy_offset: 11583488
                seen_offset: 11583488
                copy_size: 4096
                flags:

Using: wget --rate-limit=200k <64mb test file> with the above storage config, 
through a localhost proxy, 
sees Lusca slowly grow the memory usage; I wonder why it's doing that.

Original comment by adrian.c...@gmail.com on 10 Feb 2010 at 12:14

Added labels: Priority-Critical
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Closing this ticket - the initial thing was sorted out.

Original comment by adrian.c...@gmail.com on 10 Feb 2010 at 12:18

Changed state: Fixed

google-code-export / lusca-cache

Performance bottleneck? stmemref() #73