Open DmitryDodzin opened 2 weeks ago
I don't think that limiting the throughput is a solution since there's no real reason we wouldn't support such throughout - the only limitation should be resources and if those aren't the constraints then it should work
It seems to me that the problem originates in TcpConnectionSniffer
sending data to all clients synchronously. In case one of the clients is not processing data quickly enough (or our agent-client connections don't handle the throughput in general), the sniffer blocks. Raw socket's recv queue grows and we start dropping IP packets.
What is a bit sad is that we cannot solve this problem entirely. There is no way (or at least no way I know of) we can apply back pressure to the connection sources, as we only sniff the incoming packets.
Proposed mitigations:
tokio::task
to receive packets from the raw socket. This way we can still discard packets that are for sure not interesting (e.g. not TCP, unsubscribed port, unknown TCP session), regardless of laggy clients. May help with the queue filling up with garbage.TcpConnectionSniffer
client, require that the mpsc
channel has capacity (use Sender::try_send
instead of Sender::send
). If the data cannot be sent instantly, close the connection for this client and send a log with warning.@aviramha wdyt?
There is sometimes a problem in running mirrord on high traffic/high throughput services where the required throughput of
mirrord-agent
->mirrord-int-proxy
is higher than available (via port-forward or operator connection) and the agent buffers start overflowing.Possible solution is to allow http_filters (or some tcp filters like source_addr or something like that) to work somehow on mirroring or to have some sort of sampling of connections.