openziti / ziti-tunnel-sdk-c

Apache License 2.0
43 stars 16 forks source link

High CPU usage; epoll_wait is called repeatedly without blocking #722

Closed mrschyte closed 9 months ago

mrschyte commented 1 year ago

I'm trying to stream video through a Ziti service and after a couple of seconds from starting the stream, CPU usage jumps to 100% in the tunneler. I'm using ziti-edge-tunnel v0.22.7 on Linux and I'm able to reliably reproduce the issue.

Attaching strace to the tunneler process shows that most of the time is spent in epoll_wait which is repeatedly called a huge number of times from libuv. It seems that epoll_wait is called on an empty descriptor list with a 0 timeout so it returns immediately. Unfortunately my libuv knowledge is not the best, but I don't think epoll_wait should run in a hot loop without sleeping at all.

Although CPU utilization is high, everything else seems to be working fine and I don't see any errors in the tunneler logs.

$ strace -c -f -p 280524
strace: Process 280524 attached with 5 threads
^Cstrace: Process 280524 detached
strace: Process 280531 detached
strace: Process 280532 detached
strace: Process 280533 detached
strace: Process 280534 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 62.19    2.386891      477378         5         1 futex
 31.65    1.214673           2    530457           epoll_wait
  3.85    0.147700           9     15403       443 read
  1.89    0.072708           6     11955           write
  0.36    0.013649           8      1706           brk
  0.05    0.001818         303         6           poll
  0.01    0.000321          45         7           socket
  0.00    0.000120          20         6         2 connect

$ strace -p 280524
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, [], 1024, 0)              = 0
epoll_wait(3, ^Cstrace: Process 280524 detached

$ gdb -p 280524
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 280524
[New LWP 280531]
[New LWP 280532]
[New LWP 280533]
[New LWP 280534]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007fa408d11666 in epoll_wait () from /usr/lib/libc.so.6
(gdb) bt
#0  0x00007fa408d11666 in epoll_wait () from /usr/lib/libc.so.6
#1  0x000055c9bc48e835 in uv.io_poll ()
#2  0x000055c9bc47bc96 in uv_run ()
#3  0x000055c9bc418265 in run_tunneler_loop (ziti_loop=0x55c9bc7ccba0 <default_loop_struct>)
    at /__w/ziti-tunnel-sdk-c/ziti-tunnel-sdk-c/programs/ziti-edge-tunnel/ziti-edge-tunnel.c:1717
#4  0x000055c9bc418641 in run_tunnel (ziti_loop=ziti_loop@entry=0x55c9bc7ccba0 <default_loop_struct>,
    tun_ip=tun_ip@entry=16793700, dns_ip=dns_ip@entry=33570916, ip_range=<optimized out>,
    dns_upstream=<optimized out>)
    at /__w/ziti-tunnel-sdk-c/ziti-tunnel-sdk-c/programs/ziti-edge-tunnel/ziti-edge-tunnel.c:1582
#5  0x000055c9bc4188e9 in run (argc=<optimized out>, argv=<optimized out>)
    at /__w/ziti-tunnel-sdk-c/ziti-tunnel-sdk-c/programs/ziti-edge-tunnel/ziti-edge-tunnel.c:2031
#6  0x000055c9bc4164ce in main (argc=4, argv=0x7fffd9745ef8)
    at /__w/ziti-tunnel-sdk-c/ziti-tunnel-sdk-c/programs/ziti-edge-tunnel/ziti-edge-tunnel.c:3205
mrschyte commented 1 year ago

The high cpu utilization only seems to be happening if the stream is played with a media client which buffers only up to a certain point. Downloading the stream using e.g. wget doesn't trigger the issue.

mrschyte commented 1 year ago

Adding a small amount of sleep to the on_ziti_data function cuts the CPU usage, however I'm not sure if this is the best way to resolve this.

static ssize_t on_ziti_data(ziti_connection conn, const uint8_t *data, ssize_t len) {
    struct io_ctx_s *io = ziti_conn_data(conn);
    ZITI_LOG(TRACE, "got %zd bytes from ziti", len);
    if (io == NULL) {
        ZITI_LOG(WARN, "null io. underlay connection possibly leaked. ziti_conn[%p] len[%zd]", conn, len);
        ziti_close(conn, NULL);
        return UV_ECONNABORTED;
    }
    ziti_io_context *ziti_io_ctx = io->ziti_io;
    if (len > 0) {
        ssize_t accepted = ziti_tunneler_write(io->tnlr_io, data, len);

        // sleep 0.1s if the client is not accepting data to avoid tight loop on epoll_wait
        if (accepted == 0) {
            usleep(100000);
        }

        if (accepted < 0) {
            ZITI_LOG(ERROR, "failed to write to client");
            ziti_sdk_c_close(io->ziti_io);
        }
        return accepted;
    } else if (len == ZITI_EOF) {
        ZITI_LOG(DEBUG, "ziti connection sent EOF (ziti_eof=%d, tnlr_eof=%d)", ziti_io_ctx->ziti_eof, ziti_io_ctx->tnlr_eof);
        ziti_io_ctx->ziti_eof = true; /* no more data will come from this connection */
        if (ziti_io_ctx->tnlr_eof) /* both sides are done sending now, so close both */ {
            ziti_close(conn, ziti_conn_close_cb);
        } else {
            // this ziti conn can still receive but it will not send any more, so
            // we will not write to the client any more. send FIN to the client.
            // eventually the client will send FIN and the tsdk will call ziti_sdk_c_close_write.
            ziti_tunneler_close_write(io->tnlr_io);
        }
    } else if (len < 0) {
        ZITI_LOG(DEBUG, "ziti connection is closed due to [%zd](%s)", len, ziti_errorstr(len));
        ziti_close(conn, ziti_conn_close_cb);
    }
    return len;
}
ekoby commented 1 year ago

It's a known issue. The problem is the lack of back pressure mechanism in the underlying SDK. It is being worked on and should be available in the tunneler relatively soon.

See https://github.com/openziti/tlsuv/issues/171

mrschyte commented 1 year ago

Thanks, that's great news! I think we should keep this issue open until openziti/tlsuv#171 is fixed since it wasn't obvious that this problem is already tracked.

qrkourier commented 9 months ago

The tlsuv issue was closed. Please reopen if reproducible with latest.