vector and tcp sink are unable to recover from killed tcp session, try to tcp retransmit forever

user318 commented 4 years ago

Vector Version

vector 0.10.0 (g0f0311a x86_64-unknown-linux-musl 2020-07-22)

Vector Configuration File

[sources.collectd]
type = "socket"
address = "0.0.0.0:2003"
mode = "tcp"

[transforms.update_host]
inputs = ["collectd"]
type = "lua"
version = "2"
hooks.init = "init"
hooks.process = "process"
timers = [ {interval_seconds = 5, handler = "timer_handler"} ]
search_dirs = ["/etc/vector/lua/"]
source = """
require('update_host')
"""

[sinks.out1]
inputs = ["update_host"]
type = "socket"
mode = "tcp"
encoding.codec = "text"
address = "10.0.0.1:2003"
healthcheck = false
buffer.type = "memory"
buffer.max = 10000
buffer.when_full = "drop_newest"

[sinks.out2]
inputs = ["update_host"]
type = "socket"
mode = "tcp"
encoding.codec = "text"
address = "10.0.0.1:2004"
healthcheck = false
buffer.type = "memory"
buffer.max = 10000
buffer.when_full = "drop_newest"

[sinks.out3]
inputs = ["update_host"]
type = "socket"
mode = "tcp"
encoding.codec = "text"
address = "10.0.0.1:2005"
healthcheck = false
buffer.type = "memory"
buffer.max = 10000
buffer.when_full = "drop_newest"

function process(event,emit)
        local host = hostmap[event.log.host]
        if host == nil then return end
        event.log.message = event.log.message:gsub("^[^. ]*", host)
        emit(event)
end
...
function timer_handler(emit)
        local ts = get_ts()
        if ts == filemod then return end
        read_map()
end

Debug Output

I do not have relevant debug. There are too much logs with debug and they are rotated. And in recent logs there are no traces of the problem.

Expected Behavior

I receive strings over tcp socket (graphite metrics specifically, but it does not matter), alter them with lua script and send those strings out to three tcp sockets. I expect Vector to send data to all three sinks (tcp sockets). Or at least try to reconnect if there are problems.

Actual Behavior

After several days of running, Vector just stops sending to some of the sinks. I do not see an established connection in netstat and do not see any attempts to reestablish it in tcpdump. I have seen that behaviour at least three times already, first time it stopped sending to out2 and out3, and two times out2. I tried to send USR1, but it did not help.

I did not find anything useful in logs, Vector just do not report any errors about it or something else, with LOG=debug last time it showed writes to port 2003 (out1) and port 2005 (out3) and completely nothing about 2004 (out2).

ypid-geberit commented 3 years ago

I consider this a serious bug and seeing that this was reported a year ago makes me wonder about prios bugs vs. features. Could this be prioritized? It makes the socket sink that is marked as "stable" not usable (ref: https://vector.dev/docs/reference/configuration/sources/socket/).

I just reproduced it with both the latest stable and nightly:

vector 0.17.0 (x86_64-unknown-linux-gnu 06ef4d7 2021-09-10)
vector 0.16.1 (x86_64-unknown-linux-gnu 739e878 2021-08-26)

Configure socket sink in TCP mode on host agent, mini working example (MWE):

---
# vim: foldmarker=[[[,]]]:foldmethod=marker

# SPDX-FileCopyrightText: 2021 Robin Schneider <robin.schneider@geberit.com>
#
# SPDX-License-Identifier: CC0-1.0

data_dir: '/tmp'

sources:

  source_stdin:
    type: 'stdin'

sinks:

  sink_socket:
    type: 'socket'
    inputs: ['source_stdin']
    address: 'logs.example.net:5149'
    mode: 'tcp'
    encoding:
      codec: 'json'
    healthcheck:
      enabled: true

Run a TCP server on host logs with a host or network stateful Firewall in between. For example iptables and this Ncat command: ncat -l -k 5149
Start the vector agent with the MWE config on host agent: vector --config issue_4933.yaml
Type something in the vector agent, press enter. A JSON object should be received and written to STDOUT by ncat.
Kill the existing TCP session in the Firewall.
Repeat 4. Now nothing is received by ncat. I waited 12 minutes.
Restart Vector, repeat 4. Now ncat receives something again.

There can be many reasons why TCP sessions might get killed/dropped like that. Vector should be able to recover from that.

StephenWakely commented 2 years ago

@ypid-geberit I'm struggling to reproduce this issue. Are you able to provide more details, in particular this stage:

Kill the existing TCP session in the Firewall.

How exactly are you killing the session?

jszwedko commented 2 years ago

Closing due to inactivity, but if anyone else is able to reproduce this, please re-open with reproduction instructions!

user318 commented 2 years ago

And has anyone of the developers ever tried my example from the initial message?

jszwedko commented 2 years ago

Hi @user318!

We did, but weren't able to reproduce which led to this comment: https://github.com/vectordotdev/vector/issues/4933#issuecomment-1076545727

I think we need more details about how you were configuring the firewall and terminating the session.

ypid-work commented 2 years ago

Sorry for the delay, I switched jobs and it took awhile until I could start deploying Vector. Can you reopen and rename to "vector and tcp sink are unable to recover from killed tcp session, try to tcp retransmit forever"?

How serious is this?/Executive summary

If an organization has (stateful/normal) firewalls deployed and the connection state table on the firewall is ever cleared for example by a restart of the firewall (if not clustered) or the table is cleared by an admin, then no more logs will be received by the vector server because the vector agent just tries to retransmit using a tcp session the firewall knows nothing about anymore and therefore either drops the packet or sends a reject packet. In either case, the vector agent fails to recover from this condition.

How to reproduce

Here is how you can reproduce this (step 5.) on Linux, either a Linux Router/Firewall in between or running on the aggregator/server side. Killing the session on the host were the agent runs is also possible but blocking outdoing connections is less common.

Save the following as nft.conf

#!/usr/sbin/nft -f

table inet vector_bug
delete table inet vector_bug

table inet vector_bug {
      chain input {
              type filter hook input priority 0;

              ct state vmap { established : accept, related : accept, invalid : drop } 

              tcp dport 5515 tcp flags syn ct state new counter accept
              # tcp dport 5515 counter drop
              tcp dport 5515 counter reject
      }
}

Load the table nft -f nft.conf
Kill the session with iptstate by pressing x on the matching entry in the conntrack table.

In Wireshark it looks like this:

131 until (including) 137 was with drop. 138 until 141 was with reject. 142 was the shutdown of the vector agent.

Depending on the Firewall config, the TCP session could be continued even if the state was killed. At least configs like https://wiki.nftables.org/wiki-nftables/index.php/Simple_ruleset_for_a_server But a lot of "enterprise" firewalls are configured as I posted above in the nftables config. Only accept new sessions if they start with syn which Vector does not attempt. Therefore it will forever try to re transmit.

I did this test with two Vector instances. vector 0.22.3 (x86_64-unknown-linux-gnu). Vector logs this as it does the tcp retransmits:

2022-07-15T14:08:48.771398Z  WARN sink{component_kind="sink" component_id=sink_socket component_type=vector component_name=sink_socket}:request{request_id=3}: vector::sinks::util::retries: Request timed out. If this happens often while the events are actually reaching their destination, try decreasing `batch.max_bytes` and/or using `compression` if applicable. Alternatively `request.timeout_secs` can be increased.

I did the same test with Logstash as agent with this config:

input { stdin { } }

output {
  tcp {
    host => "logs.example.net"
    port => 5515
  }
}

Logstash has the exact same behavior. But I have only waited a few minutes and have not researched this in depth.

Then I got curious if the Vector source and sink is also affected. It is!

Agent:

---
# vim: foldmarker=[[[,]]]:foldmethod=marker

data_dir: '/tmp'

sources:

  source_stdin:
    type: 'stdin'

sinks:

  sink_socket:
    type: 'vector'
    inputs: ['source_stdin']
    address: 'logs.example.net:5515'

Aggregator/server:

---
# vim: foldmarker=[[[,]]]:foldmethod=marker

data_dir: '/tmp'

sources:

  source_vector:
    type: 'vector'
    address: '0.0.0.0:5515'

sinks:

  sink_stdout:
    type: 'console'
    inputs: ['source_vector']
    encoding:
      codec: 'json

Same behavior. I would have expected that this more complex protocol can recover from this.

Solution

Try a TCP syn to establish a new connection when running into a timeout.

Note: This is the kind of bug that normally is worked around by a restart. Restarts do help (here) but don’t fix the root cause. The root cause is difficult to understand/troubleshoot but easy to fix once troubleshooted. Please fix it :)

Doing the firewall with legacy iptables (you can ignore this)

I first did this with iptables but iptables is legacy so only for reference here because iptables is still widely used:

iptables -A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
iptables -A INPUT -p tcp --dport 5515 --syn -m conntrack --ctstate NEW -j ACCEPT
iptables -A INPUT -p tcp --dport 5515 -j DROP

jszwedko commented 2 years ago

Thanks for this detailed reproduction case @ypid-work ! I'll reopen and we can try to dig into this soon.

tobz commented 2 years ago

Alright, I'm going to be a slight party pooper here: Vector is doing the best it reasonably can here, and this isn't really a bug. Let me explain why.

As you've pointed out, stateful firewalls, and even common load balancers like IPVS, can either lose their connection tracking tables or have them cleared. This part is very common and I agree that systems must cope with that fact. Ultimately, Vector (or any other application with an established TCP connection) cannot reasonably know if the remote peer is simply taking a long time to respond or if the connection is dead: a RST packet was never sent, so as far as Vector knows, the connection is still established.

The correct solution here (IMO) is to configure TCP keepalives so that the OS starts trying to probe the remote peer when it has been idle long enough, and will eventually detect the peer is dead and forcefully close the connection on the Vector side, triggering Vector to open a new TCP connection. Otherwise, we need to implement sink-specific heartbeats which will often not be possible with things like the socket sink.

We do theoretically have support for configuring TCP keepalives for the socket sink in TCP mode, as well as for version 1 of the vector sink, but I still need to do local testing to make sure that Vector is actually configuring TCP keepalives correctly.

I'll report back shortly with any potential issues that we file as a result of changes/fixes that need to occur to allow using TCP keepalives correctly.

AleksanderDushku commented 4 months ago

Hi @jszwedko @tobz, is there any update for any fixes or workaround for this issue?

jszwedko commented 4 months ago

Hi @jszwedko @tobz, is there any update for any fixes or workaround for this issue?

No updates yet unfortunately.

vectordotdev / vector