Small Agent Manager Performance Tweaks

jeregrine commented 3 years ago

Queues up to 10 messages into a BatchCommand to avoid waiting for each message
Uses unix local sockets, this lets erlang do slightly less work than it would for a tcp socket. (experimental)
Skips decoding the response, only does that when we actually debug.

A customer reported that memory was steadily growing in the AgentManager process during periods of high load, reporting that the AgentManager mailbox was exploding in size as the culprit.

The Core Agent is single threaded so opening more agents connections will not help, we need a way to buffer messages or block the AgentManager for less time so it can chew threw its mailbox faster.

Ideally these changes will take pressure off the AgentManager process by only sending messages 1/10th the time it was previously, and doing less work per message in aggregate. There is a risk that we lose messages if it closes before we send.

Possible Future Changes

[ ] Flush buffer before the genserver shuts down.
[ ] Set recvbuf to 1 and skip reading the response at all, only unblocking when we've received any message at all, which in this case would be the message length response.
[ ] Drop messages when load is high, we could check the mailbox length and if its a certain depth we could selectively drop messages based on type, possibly giving the user the ability to configure which messages to drop.

dlanderson commented 3 years ago

@jeregrine Unless TCP overhead vs local socket is adding not insignificant overhead, let's keep TCP as the default. We ran into a lot of issues with the unix socket (permissions, mounting/path issues, etc) that we don't have to deal with when using TCP. These days, the TCP stacks on modern OS distros are optimized enough that we shouldn't be seeing a dramatic difference in overhead. See also: https://github.com/scoutapp/scout_apm_elixir/issues/115 (should be closed/resolved but somehow it's still marked as open :)

dlanderson commented 2 years ago

@jeregrine Any update on this? We had another customer hit issues with a very large message queue

jeregrine commented 2 years ago

To clarify: this PR has things that speed up the elixir api reporter but we're not bottlenecked here. We're bottlenecked waiting for the Agent to respond. We can't open multiple connects to it or send data faster (at least that was the case when this PR opened).

So we could implement a periodic task that drops the queue with some heuristics and users lose data. Which I guess is better than the situation now when the agent crashes because it's overloaded.

Let me know how you'd like me to proceed but at the moment we're kinda stuck between rock and a hard place.

scoutapp / scout_apm_elixir

Small Agent Manager Performance Tweaks #121

Possible Future Changes