Open user318 opened 4 years ago
I consider this a serious bug and seeing that this was reported a year ago makes me wonder about prios bugs vs. features. Could this be prioritized? It makes the socket sink that is marked as "stable" not usable (ref: https://vector.dev/docs/reference/configuration/sources/socket/).
I just reproduced it with both the latest stable and nightly:
vector 0.17.0 (x86_64-unknown-linux-gnu 06ef4d7 2021-09-10)
vector 0.16.1 (x86_64-unknown-linux-gnu 739e878 2021-08-26)
agent
, mini working example (MWE):---
# vim: foldmarker=[[[,]]]:foldmethod=marker
# SPDX-FileCopyrightText: 2021 Robin Schneider <robin.schneider@geberit.com>
#
# SPDX-License-Identifier: CC0-1.0
data_dir: '/tmp'
sources:
source_stdin:
type: 'stdin'
sinks:
sink_socket:
type: 'socket'
inputs: ['source_stdin']
address: 'logs.example.net:5149'
mode: 'tcp'
encoding:
codec: 'json'
healthcheck:
enabled: true
logs
with a host or network stateful Firewall in between. For example iptables and this Ncat command: ncat -l -k 5149
agent
: vector --config issue_4933.yaml
There can be many reasons why TCP sessions might get killed/dropped like that. Vector should be able to recover from that.
@ypid-geberit I'm struggling to reproduce this issue. Are you able to provide more details, in particular this stage:
- Kill the existing TCP session in the Firewall.
How exactly are you killing the session?
Closing due to inactivity, but if anyone else is able to reproduce this, please re-open with reproduction instructions!
And has anyone of the developers ever tried my example from the initial message?
Hi @user318!
We did, but weren't able to reproduce which led to this comment: https://github.com/vectordotdev/vector/issues/4933#issuecomment-1076545727
I think we need more details about how you were configuring the firewall and terminating the session.
Sorry for the delay, I switched jobs and it took awhile until I could start deploying Vector. Can you reopen and rename to "vector and tcp sink are unable to recover from killed tcp session, try to tcp retransmit forever"?
If an organization has (stateful/normal) firewalls deployed and the connection state table on the firewall is ever cleared for example by a restart of the firewall (if not clustered) or the table is cleared by an admin, then no more logs will be received by the vector server because the vector agent just tries to retransmit using a tcp session the firewall knows nothing about anymore and therefore either drops the packet or sends a reject packet. In either case, the vector agent fails to recover from this condition.
Here is how you can reproduce this (step 5.) on Linux, either a Linux Router/Firewall in between or running on the aggregator/server side. Killing the session on the host were the agent runs is also possible but blocking outdoing connections is less common.
Save the following as nft.conf
#!/usr/sbin/nft -f
table inet vector_bug
delete table inet vector_bug
table inet vector_bug {
chain input {
type filter hook input priority 0;
ct state vmap { established : accept, related : accept, invalid : drop }
tcp dport 5515 tcp flags syn ct state new counter accept
# tcp dport 5515 counter drop
tcp dport 5515 counter reject
}
}
Load the table nft -f nft.conf
Kill the session with iptstate
by pressing x on the matching entry in the conntrack table.
In Wireshark it looks like this:
131 until (including) 137 was with drop. 138 until 141 was with reject. 142 was the shutdown of the vector agent.
Depending on the Firewall config, the TCP session could be continued even if the state was killed. At least configs like https://wiki.nftables.org/wiki-nftables/index.php/Simple_ruleset_for_a_server But a lot of "enterprise" firewalls are configured as I posted above in the nftables config. Only accept new sessions if they start with syn which Vector does not attempt. Therefore it will forever try to re transmit.
I did this test with two Vector instances. vector 0.22.3 (x86_64-unknown-linux-gnu)
. Vector logs this as it does the tcp retransmits:
2022-07-15T14:08:48.771398Z WARN sink{component_kind="sink" component_id=sink_socket component_type=vector component_name=sink_socket}:request{request_id=3}: vector::sinks::util::retries: Request timed out. If this happens often while the events are actually reaching their destination, try decreasing `batch.max_bytes` and/or using `compression` if applicable. Alternatively `request.timeout_secs` can be increased.
I did the same test with Logstash as agent with this config:
input { stdin { } }
output {
tcp {
host => "logs.example.net"
port => 5515
}
}
Logstash has the exact same behavior. But I have only waited a few minutes and have not researched this in depth.
Then I got curious if the Vector source and sink is also affected. It is!
Agent:
---
# vim: foldmarker=[[[,]]]:foldmethod=marker
data_dir: '/tmp'
sources:
source_stdin:
type: 'stdin'
sinks:
sink_socket:
type: 'vector'
inputs: ['source_stdin']
address: 'logs.example.net:5515'
Aggregator/server:
---
# vim: foldmarker=[[[,]]]:foldmethod=marker
data_dir: '/tmp'
sources:
source_vector:
type: 'vector'
address: '0.0.0.0:5515'
sinks:
sink_stdout:
type: 'console'
inputs: ['source_vector']
encoding:
codec: 'json
Same behavior. I would have expected that this more complex protocol can recover from this.
Try a TCP syn to establish a new connection when running into a timeout.
Note: This is the kind of bug that normally is worked around by a restart. Restarts do help (here) but don’t fix the root cause. The root cause is difficult to understand/troubleshoot but easy to fix once troubleshooted. Please fix it :)
I first did this with iptables but iptables is legacy so only for reference here because iptables is still widely used:
iptables -A INPUT -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
iptables -A INPUT -p tcp --dport 5515 --syn -m conntrack --ctstate NEW -j ACCEPT
iptables -A INPUT -p tcp --dport 5515 -j DROP
Thanks for this detailed reproduction case @ypid-work ! I'll reopen and we can try to dig into this soon.
Alright, I'm going to be a slight party pooper here: Vector is doing the best it reasonably can here, and this isn't really a bug. Let me explain why.
As you've pointed out, stateful firewalls, and even common load balancers like IPVS, can either lose their connection tracking tables or have them cleared. This part is very common and I agree that systems must cope with that fact. Ultimately, Vector (or any other application with an established TCP connection) cannot reasonably know if the remote peer is simply taking a long time to respond or if the connection is dead: a RST packet was never sent, so as far as Vector knows, the connection is still established.
The correct solution here (IMO) is to configure TCP keepalives so that the OS starts trying to probe the remote peer when it has been idle long enough, and will eventually detect the peer is dead and forcefully close the connection on the Vector side, triggering Vector to open a new TCP connection. Otherwise, we need to implement sink-specific heartbeats which will often not be possible with things like the socket
sink.
We do theoretically have support for configuring TCP keepalives for the socket
sink in TCP mode, as well as for version 1 of the vector
sink, but I still need to do local testing to make sure that Vector is actually configuring TCP keepalives correctly.
I'll report back shortly with any potential issues that we file as a result of changes/fixes that need to occur to allow using TCP keepalives correctly.
Hi @jszwedko @tobz, is there any update for any fixes or workaround for this issue?
Hi @jszwedko @tobz, is there any update for any fixes or workaround for this issue?
No updates yet unfortunately.
Vector Version
Vector Configuration File
Debug Output
I do not have relevant debug. There are too much logs with debug and they are rotated. And in recent logs there are no traces of the problem.
Expected Behavior
I receive strings over tcp socket (graphite metrics specifically, but it does not matter), alter them with lua script and send those strings out to three tcp sockets. I expect Vector to send data to all three sinks (tcp sockets). Or at least try to reconnect if there are problems.
Actual Behavior
After several days of running, Vector just stops sending to some of the sinks. I do not see an established connection in netstat and do not see any attempts to reestablish it in tcpdump. I have seen that behaviour at least three times already, first time it stopped sending to out2 and out3, and two times out2. I tried to send USR1, but it did not help.
I did not find anything useful in logs, Vector just do not report any errors about it or something else, with LOG=debug last time it showed writes to port 2003 (out1) and port 2005 (out3) and completely nothing about 2004 (out2).