haproxy / haproxy

HAProxy Load Balancer's development branch (mirror of git.haproxy.org)
https://git.haproxy.org/
Other
4.98k stars 799 forks source link

Lua Channel.data may still return empty string #2716

Open vmspike opened 1 month ago

vmspike commented 1 month ago

Detailed Description of the Problem

Based on https://discourse.haproxy.org/t/clarify-channel-data-behavior/10190/2 Related to #1993

Noticed that this function still returns empty string from time to time. Usually (always?) when the channel finally got closed (if explicitly checked by txn.req:may_recv() when empty string returned).

Hope that this function is not the cause of connection closure...

Expected Behavior

Always return either nil or some data (wait for some data if none yet or less than length), not empty string. Or modify documentation accordingly to behavior.

Steps to Reproduce the Behavior

Use a lot of requests with some tcp frontend action like this:

local function f_fe(txn)
    local some_offset = 2
    c = txn.req:data(some_offset)
    if c == nil then
    -- if c == nil or not txn.req:may_recv() then
        core.log(core.info, 'Failed to read from buffer case 1')
        txn:done()
        return
    elseif #c < 3 then
        -- Here strangely 8 00-bytes printed by txn.sc:be2hex when #c is 0
        core.log(core.info, 'Intermediate data: '..txn.sc:be2hex(c,''))

        c = txn.req:data(some_offset,3)
        if c == nil then
        -- if c == nil or not txn.req:may_recv() then
            core.log(core.info, 'Failed to read from buffer case 2')
            txn:done()
            return
        elseif #c < 3 then
            -- Empty string here ocasionally
            core.log(core.info, 'Failed to read from buffer case 3, data: '..txn.sc:be2hex(c,''))
            txn:done()
            return
        end
    end
end
core.register_action('f_fe', { 'tcp-req' }, f_fe, 0)

Originally tested on socks5 protocol handler...

Do you have any idea what may have caused this?

Suspect that might be related to a race after channel_may_recv(chn).

Do you have an idea how to solve the issue?

Update the docs. For user's scripts test for both nil and empty string equally.

What is your configuration?

Debian 12
haproxy 2.8.2 or 2.8.10 from bookworm-backports

Config is way too heavy and private to submit it here.

Output of haproxy -vv

HAProxy version 2.8.10-1~bpo12+1 2024/06/16 - https://haproxy.org/
Status: long-term supported branch - will stop receiving fixes around Q2 2028.
Known bugs: http://www.haproxy.org/bugs/bugs-2.8.10.html
Running on: Linux 6.1.0-10-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-1 (2023-07-14) x86_64
Build options :
  TARGET  = linux-glibc
  CPU     = generic
  CC      = cc
  CFLAGS  = -O2 -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wall -Wextra -Wundef -Wdeclaration-after-statement -Wfatal-errors -Wtype-limits -Wshift-negative-value -Wshift-overflow=2 -Wduplicated-cond -Wnull-dereference -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wno-cast-function-type -Wno-string-plus-int -Wno-atomic-alignment
  OPTIONS = USE_OPENSSL=1 USE_LUA=1 USE_SLZ=1 USE_SYSTEMD=1 USE_OT=1 USE_QUIC=1 USE_PROMEX=1 USE_PCRE2=1 USE_PCRE2_JIT=1 USE_QUIC_OPENSSL_COMPAT=1
  DEBUG   = -DDEBUG_STRICT -DDEBUG_MEMORY_POOLS

Feature list : -51DEGREES +ACCEPT4 +BACKTRACE -CLOSEFROM +CPU_AFFINITY +CRYPT_H -DEVICEATLAS +DL -ENGINE +EPOLL -EVPORTS +GETADDRINFO -KQUEUE -LIBATOMIC +LIBCRYPT +LINUX_CAP +LINUX_SPLICE +LINUX_TPROXY +LUA +MATH -MEMORY_PROFILING +NETFILTER +NS -OBSOLETE_LINKER +OPENSSL -OPENSSL_WOLFSSL +OT -PCRE +PCRE2 +PCRE2_JIT -PCRE_JIT +POLL +PRCTL -PROCCTL +PROMEX -PTHREAD_EMULATION +QUIC +QUIC_OPENSSL_COMPAT +RT +SHM_OPEN +SLZ +SSL -STATIC_PCRE -STATIC_PCRE2 +SYSTEMD +TFO +THREAD +THREAD_DUMP +TPROXY -WURFL -ZLIB

Default settings :
  bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=12).
Built with OpenSSL version : OpenSSL 3.0.11 19 Sep 2023
Running on OpenSSL version : OpenSSL 3.0.14 4 Jun 2024
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
OpenSSL providers loaded : default
Built with Lua version : Lua 5.3.6
Built with the Prometheus exporter as a service
Built with network namespace support.
Built with OpenTracing support.
Built with libslz for stateless compression.
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE2 version : 10.42 2022-12-11
PCRE2 library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with gcc compiler version 12.2.0

Available polling systems :
      epoll : pref=300,  test result OK
       poll : pref=200,  test result OK
     select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
       quic : mode=HTTP  side=FE     mux=QUIC  flags=HTX|NO_UPG|FRAMED
         h2 : mode=HTTP  side=FE|BE  mux=H2    flags=HTX|HOL_RISK|NO_UPG
       fcgi : mode=HTTP  side=BE     mux=FCGI  flags=HTX|HOL_RISK|NO_UPG
  <default> : mode=HTTP  side=FE|BE  mux=H1    flags=HTX
         h1 : mode=HTTP  side=FE|BE  mux=H1    flags=HTX|NO_UPG
  <default> : mode=TCP   side=FE|BE  mux=PASS  flags=
       none : mode=TCP   side=FE|BE  mux=PASS  flags=NO_UPG

Available services : prometheus-exporter
Available filters :
        [BWLIM] bwlim-in
        [BWLIM] bwlim-out
        [CACHE] cache
        [COMP] compression
        [FCGI] fcgi-app
        [  OT] opentracing
        [SPOE] spoe
        [TRACE] trace

Last Outputs and Backtraces

No response

Additional Information

No response

capflam commented 1 month ago

@vmspike, thanks for report. Could you at least share the part of your config calling your lua script ? But a common error in TCP is to call it from a tcp-request content rule without inspect delay configured.

By reading the code quickly, it seems to do what is documented. I have only one doubt about the offset. If it is smaller than the available data in the channel buffer, an error out of range is returned. It seems a bit strange and rude. But I have no longer the context in mind.

However, at several places, you call txn:done(). When this happens the session is closed. It don't know if it what you want to do.

capflam commented 1 month ago

However, I can confirm there is a way to have an empty string. It happens when offset 0 is set and a length is provided. So txn.req:data(0, 10) for instance.

vmspike commented 1 month ago

Thank you for taking a look to the issue! As for documentation I understood that data() should return either data, or wait for data till some timeout hits (or lua processing itself timed out), or return nill if no data was received during the wait and unable to receive more (e.g. if connection is terminated)

inspect delay is set. txn:done() is desired because such request does not pass certain checks.

This is the basic truncated config:

global
    daemon
    cpu-map 1/1- 0-
    lua-load socks.lua
    tune.lua.maxmem 2048
    tune.lua.session-timeout 4s
    <...>

defaults fe
    mode tcp
    tcp-request inspect-delay 4s
    timeout client 4s
    timeout client-fin 3s
    option tcp-smart-accept
    <...>

# Few hundreds of such frontends and related backends
backend some_tls_wrap_be from be-tls
    server some_srv abns@some_socket send-proxy-v2
backend some_http_be from be-http
    server <...>
backend some_socks_be from be-tcp
    <...>
    tcp-response content lua.socks_be
    server <...>
frontend some_frontend from fe
    bind "ipv4@${SOME_BIND_IP}:${SOME_BIND_PORT}" id 1
    bind abns@some_socket accept-proxy ssl crt "${TLS_CERT}" alpn h2,http/1.1 id 2
    acl TLS req.ssl_hello_type -m int 1
    acl SO_PUBLIC so_id -m int 1
    acl <...>
    tcp-request connection <...>
    tcp-request session <...>
    tcp-request content <...>
    tcp-request content accept if SO_PUBLIC TLS
    tcp-request content <...>
    tcp-request content switch-mode http if HTTP
    tcp-request content lua.socks_fe
    <...>
    http-request <...>
    <...>
    use_backend some_tls_wrap_be if SO_PUBLIC TLS
    use_backend some_http_be if HTTP
    default_backend some_socks_be

It's hybrid tcp/http frontend for http/socks proxies traffic with optional tls-wrap.

As a side note it's a log-format issue for hybrid backends, but it's offtopic .

Channel.data used in different combinations, e.g.: txn.req:data() - ? txn.req:data(0, 3) - ? txn.req:data(3) - may return empty string txn.req:data(3, 10) - the most common case to get empty string as a result but offset seems always related to already present data in buffer (e.g. the end of already handled data).

Also for no-lenth data (usually with offset set) when empty string returned (so #c is 0), txn.sc:be2hex(c,'') strangely returns 8-bytes zeroes:

Intermediate data (of size 0): 0000000000000000

Not sure it's related.

I suspect such empty string cases related mostly to quite distant src location requests (inter continents).

capflam commented 1 month ago

inspect delay is set. txn:done() is desired because such request does not pass certain checks.

Copy that. That was to be sure because it is a common mistake. However, it may be an oversight, but there is no inspect-delay on the response.

Channel.data used in different combinations, e.g.: txn.req:data() - ? txn.req:data(0, 3) - ? txn.req:data(3) - may return empty string txn.req:data(3, 10) - the most common case to get empty string as a result but offset seems always related to already present data in buffer (e.g. the end of already handled data).

For the offset, at first glance, it seems a bit strange. The code is written this way. But I don't know if it is expected or not. I must be careful because these functions may also be called from a lua filter. So, I must check first before breaking other things. But there is indeed a bug about the empty string and I will fix it.

Also for no-lenth data (usually with offset set) when empty string returned (so #c is 0), txn.sc:be2hex(c,'') strangely returns 8-bytes zeroes:

I must check, it seems to be a bug.

vmspike commented 1 month ago

it may be an oversight, but there is no inspect-delay on the response

It's hidden in truncated part! Though it's set, but does not related to this report (I experience it purely on request/frontend part).

capflam commented 1 month ago

So, I have some answers.

About your strange output when you call txn.sc:be2hex(), it happens when you call it with nil and it seems expected. Only integers, booleans and strings are converted from lua to the corresponding internal sample type. All other lua types are interpreted as false internally . Because be2hex expects a binary on input, nil, interpreted as false, is converted to the binary value of 0 on 64 bits,

Then there is probably an issue with the offset usage when Channel.data() is called. As said, I must try to remember why it was done this way. But it seems strange to return an error when some data may still be received and there is not enough data to respect the offset. It means the data length must be checked before, and it seems a bit painful. I suspect it is related to the lua filters.

About the documentation, depending how you read it, the function does what is documented. nil is returned when there is no incoming data at all and no more data can be received anymore. Otherwise length bytes of incoming data are returned, starting at the offset offseŧ, or all available incoming data if the length argument cannot be respected. So, this may be an empty string if there are exactly offset bytes of incoming data. So txn.sc:data(3,10) will return an empty string if only 3 bytes are received. The function does not return nil because there are incoming data.

vmspike commented 1 month ago

About your strange output when you call txn.sc:be2hex(), it happens when you call it with nil and it seems expected.

Think it's different. I've seen such behavior in a place (see the code example) when c is first checked for being nil, then checked for #c <3, then got that output (specifically printed tostring(#c) on the same line which was printed as 0), so c was:

  1. not nil
  2. c == 0

Suspect it's empty string then, if c is nil #c would lead to exception at least. But this happened only for the case when length is not provided, e.g. c = txn.req:data(3), not for c = txn.req:data(3,10).

About the documentation, depending how you read it, the function does what is documented. nil is returned when there is no incoming data at all and no more data can be received anymore. Otherwise length bytes of incoming data are returned, starting at the offset offseŧ, or all available incoming data if the length argument cannot be respected. So, this may be an empty string if there are exactly offset bytes of incoming data. So txn.sc:data(3,10) will return an empty string if only 3 bytes are received. The function does not return nil because there are incoming data.

Clear, thanks! So the function may return nil only for 0 offset, when the channel has no data at all. When it returns empty string (or less than length data) it also means that the channel cannot receive more data anymore (as with nil), otherwise it waits for more data right? Better to clarify this in the docs then.

Think would be more comfortable to return string always then (even empty) instead of nil (if not keeping nil for historic reasons) - less checks for nil, uniformed return datatype.

capflam commented 1 month ago

About txn.sc:be2hex(), I'm pretty sure. I've performed following tests:

      core.log(core.info, 'TEST 1 : '..txn.sc:be2hex(true,'')) -- result: "TEST 1 : 0000000000000001
      core.log(core.info, 'TEST 2 : '..txn.sc:be2hex(false,'')) -- result: "TEST 2 : 0000000000000000 
      core.log(core.info, 'TEST 3 : '..txn.sc:be2hex(nil,''))  -- result: "TEST 3 : 0000000000000000
      core.log(core.info, 'TEST 4 : '..txn.sc:be2hex('',''))  -- result "TEST 4 : " 
      core.log(core.info, 'TEST 5 : '..txn.sc:be2hex('1234','')) -- result "TEST 5 : 1234"

About the documentation, indeed, it may be improved. nil value is returned when there is no data and it is a special value to know nothing was received, at the opposite of the empty string that just mean no data was extracted but something was received. And indeed, it happens when no more data can be received. But It is not the same than no data at all. Because at the end, the data returned may be smaller than expected, I guess you should already have a test on the string length if it is a prerequisite in your script. So an empty string is a smaller string. Not really a special case.

The API is probably not perfect but it cannot easily be modified. Some scripts may rely on the actual behavior. The backward compatibility is really important in that case. From time to time, there are breakage. But it must be truly necessary.

Note, that since the beginning we said nil is returned when no more data can be received. But it may also be because the lua action can no longer wait for more data. For instance because the inspect delay was reached.

I will try to take a look to the offset parameter. I'm just busy on stable releases for now.