skupperproject / skupper-router

An application-layer router for Skupper networks
https://skupper.io
Apache License 2.0
14 stars 18 forks source link

TCP-Lite: crash when running flimflam builtin workload #1301

Open kgiusti opened 11 months ago

kgiusti commented 11 months ago

Running: flimflam run -w builtin -p tcp -r skrouterd --cpu-limit 3 Will consistently result in a SIGSEGV crash on the downstream router (skrouterd-tcp-2.config)

(gdb) bt
#0  __memmove_evex_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:664
#1  0x00007f805221b778 in pn_buffer_append (buf=0x7f80380bc970, bytes=0x7f802c0b61cf 'x' <repeats 200 times>..., size=16353) at /home/kgiusti/work/proton/qpid-proton/c/src/core/buffer.c:154
#2  0x00007f805222d2cc in pn_do_transfer (transport=0x7f802c061510, frame_type=<optimized out>, channel=<optimized out>, payload=...)
    at /home/kgiusti/work/proton/qpid-proton/c/src/core/transport.c:1471
#3  0x00007f8052221c2f in pni_dispatch_action (frame_payload=..., channel=0, frame_type=<optimized out>, lcode=<optimized out>, transport=0x7f802c061510)
    at /home/kgiusti/work/proton/qpid-proton/c/src/core/dispatcher.c:76
#4  pni_dispatch_frame (transport=0x7f802c061510, logger=0x7f802c061510, frame=...) at /home/kgiusti/work/proton/qpid-proton/c/src/core/dispatcher.c:100
#5  pn_dispatcher_input (transport=0x7f802c061510, bytes=0x7f802c0b61b0 "", available=0, batch=true, halt=0x7f802c06168a) at /home/kgiusti/work/proton/qpid-proton/c/src/core/dispatcher.c:117
#6  0x00007f805222de5c in pn_input_read_amqp (transport=0x7f802c061510, layer=<optimized out>, bytes=<optimized out>, available=<optimized out>)
    at /home/kgiusti/work/proton/qpid-proton/c/src/core/transport.c:2617
#7  0x00007f80522283ca in transport_consume (transport=0x7f802c061510) at /home/kgiusti/work/proton/qpid-proton/c/src/core/transport.c:1836
#8  0x00007f805222ebf2 in pn_transport_process (transport=0x7f802c061510, size=<optimized out>) at /home/kgiusti/work/proton/qpid-proton/c/src/core/transport.c:3012
#9  0x00007f8052271563 in pconnection_process (pc=pc@entry=0x7f802c060d90, events=<optimized out>, events@entry=0, sched_ready=sched_ready@entry=false, topup=topup@entry=true)
    at /home/kgiusti/work/proton/qpid-proton/c/src/proactor/epoll.c:1227
#10 0x00007f8052271821 in pconnection_batch_next (batch=0x7f802c060ec8) at /home/kgiusti/work/proton/qpid-proton/c/src/proactor/epoll.c:948
#11 0x00000000004c0068 in thread_run (arg=0xfff7f0) at /home/kgiusti/work/skupper/skupper-router/src/server.c:1150
#12 0x00007f8051aae19d in start_thread (arg=<optimized out>) at pthread_create.c:442
#13 0x00007f8051b2fc60 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
kgiusti commented 11 months ago

See https://issues.apache.org/jira/browse/PROTON-2775 for a description of the pn_buffer_t overflow issue.

I'm not sure how the router can prevent this particular crash as it appears to be caused by the downstream proton code attempting to buffer too much data off the incoming network connection. But that "too much data" is over a GByte so the router shouldn't have that amount of outstanding data in flight for a connection.

Ideally the upstream router would place a limit on the amount of outstanding data written to a connection. IIUC this can be done by enforcing a reasonable session window, but we'd have to consider head-of-line blocking and the hard limit of 32K sessions/connection when going down that path.