metalbear-co / mirrord

Connect your local process and your cloud environment, and run local code in cloud conditions.
https://mirrord.dev
MIT License
3.58k stars 98 forks source link

Research sessions failing under high stress mirroring #2529

Open DmitryDodzin opened 2 weeks ago

DmitryDodzin commented 2 weeks ago

There is sometimes a problem in running mirrord on high traffic/high throughput services where the required throughput of mirrord-agent -> mirrord-int-proxy is higher than available (via port-forward or operator connection) and the agent buffers start overflowing.

Possible solution is to allow http_filters (or some tcp filters like source_addr or something like that) to work somehow on mirroring or to have some sort of sampling of connections.

aviramha commented 2 weeks ago

I don't think that limiting the throughput is a solution since there's no real reason we wouldn't support such throughout - the only limitation should be resources and if those aren't the constraints then it should work

Razz4780 commented 40 minutes ago

It seems to me that the problem originates in TcpConnectionSniffer sending data to all clients synchronously. In case one of the clients is not processing data quickly enough (or our agent-client connections don't handle the throughput in general), the sniffer blocks. Raw socket's recv queue grows and we start dropping IP packets.

What is a bit sad is that we cannot solve this problem entirely. There is no way (or at least no way I know of) we can apply back pressure to the connection sources, as we only sniff the incoming packets.

Proposed mitigations:

  1. Use a separate tokio::task to receive packets from the raw socket. This way we can still discard packets that are for sure not interesting (e.g. not TCP, unsubscribed port, unknown TCP session), regardless of laggy clients. May help with the queue filling up with garbage.
  2. When sending incoming data to the TcpConnectionSniffer client, require that the mpsc channel has capacity (use Sender::try_send instead of Sender::send). If the data cannot be sent instantly, close the connection for this client and send a log with warning.
  3. Inspect data offset in intercepted TCP packets. This way we can discover dropped packets (holes in data) and close the connection for all clients.

@aviramha wdyt?