Using gen_metrics will eventually cause the application to crash

bitwalker commented 7 years ago

So I discovered a huge issue. I was reading through the Erlang docs on tracing and saw mention that the return_trace match function disables tail call optimization. In other words any long-lived process (i.e. gen_server, GenStage, etc.), which are all built on tail recursive loops, cannot have this turned on permanently or their stack will accumulate until there is no more system memory left. I tested this myself by creating an infinite Stream in my benchmark, and then iterating over that stream until I ran out of memory. In the traced version, I run out of memory very fast as each stack frame has the message sent which is pretty large. In the untraced version, this never happens.

In other words, I would strongly recommend putting a notice in the README that as it stands this cannot be used in production, it will cause applications to eventually fail - when it will fail is entirely up to how often a process receives messages (each message handled makes another loop).

It looks like the only possible solution to this would be to set the silent flag when calling trace/3, and periodically enable tracing and then disable it after a short window of time using the {:silent, <bool>} match spec function via trace_pattern/3. This should preserve the TCO as I understand it, but one would need to do some long-running tests to see if turning it on and then off just delays things by accumulating more stack frames each time it is turned on.

Sorry for the bad news, but as soon as I saw this I knew I needed to report it :(

bitwalker commented 7 years ago

So I found one possible workaround, using the following patch:

diff --git a/lib/gen_server/monitor.ex b/lib/gen_server/monitor.ex
index 5e8eab1..8048016 100644
--- a/lib/gen_server/monitor.ex
+++ b/lib/gen_server/monitor.ex
@@ -35,44 +35,7 @@ defmodule GenMetrics.GenServer.Monitor do
      do_intercept_call_request(state, mod, pid, fun, ts)}
   end

-  # Intercept {:reply, reply, new_state}
-  def handle_info({:trace_ts, pid, :return_from, {mod, fun, _arity},
-                   {:reply, _, _}, ts}, state) do
-    {:noreply,
-     do_intercept_call_response(state, mod, pid, fun, ts)}
-  end
-
-  # Intercept {:reply, reply, new_state, timeout | :hibernate}
-  def handle_info({:trace_ts, pid, :return_from, {mod, fun, _arity},
-                   {:reply, _, _, _}, ts}, state) do
-    {:noreply,
-     do_intercept_call_response(state, mod, pid, fun, ts)}
-  end
-
-  # Intercept {:noreply, new_state}
-  def handle_info({:trace_ts, pid, :return_from, {mod, fun, _arity},
-                   {:noreply, _}, ts}, state) do
-    {:noreply,
-     do_intercept_call_response(state, mod, pid, fun, ts)}
-  end
-
-  # Intercept {:noreply, new_state, timeout | :hibernate}
-  def handle_info({:trace_ts, pid, :return_from, {mod, fun, _arity},
-                   {:noreply, _, _}, ts}, state) do
-    {:noreply,
-     do_intercept_call_response(state, mod, pid, fun, ts)}
-  end
-
-  # Intercept {:stop, reason, reply, new_state}
-  def handle_info({:trace_ts, pid, :return_from, {mod, fun, _arity},
-                   {:stop, _, _, _}, ts}, state) do
-    {:noreply,
-     do_intercept_call_response(state, mod, pid, fun, ts)}
-  end
-
-  # Intercept {:stop, reason, new_state}
-  def handle_info({:trace_ts, pid, :return_from, {mod, fun, _arity},
-                   {:stop, _, _}, ts}, state) do
+  def handle_info({:trace_ts, pid, :return_to, {mod, fun, _arity}, ts}, state) do
     {:noreply,
      do_intercept_call_response(state, mod, pid, fun, ts)}
   end
@@ -117,15 +80,15 @@ defmodule GenMetrics.GenServer.Monitor do

   # Activate tracing for servers within cluster.
   defp activate_tracing(cluster) do
-    :erlang.trace(:all, true, [:call, :monotonic_timestamp])
+    :erlang.trace(:processes, true, [:call, :return_to, :monotonic_timestamp])

     for server <- cluster.servers do
       :erlang.trace_pattern({server, :handle_call, 3},
-        [{:_, [], [{:return_trace}]}])
+        [{:_, [], []}], [:local, :call_count, :call_time])
       :erlang.trace_pattern({server, :handle_cast, 2},
-        [{:_, [], [{:return_trace}]}])
+        [{:_, [], []}], [:local, :call_count, :call_time])
       :erlang.trace_pattern({server, :handle_info, 2},
-        [{:_, [], [{:return_trace}]}])
+        [{:_, [], []}], [:local, :call_count, :call_time])
     end

     {:ok, cluster}

Benchmark-wise, it looks like it doesn't change much, so the new flags should be more or less a drop-in replacement for the old in terms of the end result. Here's from my benchmark:

Elixir 1.4.1
Erlang 19.2
Benchmark suite executing with the following configuration:
warmup: 5.0s
time: 30.0s
parallel: 1
inputs: none specified
Estimated total run time: 140.0s

Benchmarking traced pipeline...
Benchmarking traced server (call)...
Benchmarking untraced pipeline...
Benchmarking untraced server (call)...

Name                             ips        average  deviation         median
untraced server (call)          0.34         2.91 s    ±13.52%         2.70 s
traced server (call)           0.161         6.21 s     ±7.33%         6.04 s
untraced pipeline             0.0887        11.27 s     ±3.64%        11.02 s
traced pipeline               0.0406        24.63 s     ±0.48%        24.63 s

Comparison:
untraced server (call)          0.34
traced server (call)           0.161 - 2.13x slower
untraced pipeline             0.0887 - 3.87x slower
traced pipeline               0.0406 - 8.45x slower

This change is just for the GenServer metrics, but it was an experiment, so I didn't do a full PR. It still has an issue, which is that due to the tracing, the GC can barely keep up, I would go from 2GB memory usage to 8GB, then down to 3GB, then up again, all over the map. I suspect that for gen_server's not dealing with large messages or really high volume, this is probably a non-issue, unless a lot of processes are being traced - that said, it's still a thing which should have a great big warning sign attached so that people know what to look out for and where to apply this.

I still feel like the best approach is that of statistical sampling, i.e. turn on tracing for windows up to 5 seconds (could be configurable) on some interval (maybe once every 30-60 seconds), do some statistical analysis (I can recommend the bear library to get histograms), and report the downsampled metrics to the backend. This should have minimal impact on runtime performance, avoid the memory issue raised here, but still provide value.

onetapbeyond commented 7 years ago

@bitwalker you're full of bad news :) But it's really great that you've taken the time to dig into this. I haven't looked at this latest issue re. tail call optimization yet, as I've been busy digging into the timing differences I'm seeing from our benchmarks. I now understand the core of that issue and this is something I'll update you on as soon as I have code in place to back up my theories. After that I'll turn my attention to the issues you've raised on this thread. Thanks again, I'll be in touch.

bitwalker commented 7 years ago

I know I know, I'm sorry! I hope my last findings work out to solve this, because then it's not quite so bad :). Let me know what you find!

onetapbeyond commented 7 years ago

Just fyi @bitwalker per your recommendation I've added a warning to the top of the README to let people know this library is not currently suitable for production environments, including a link to these open issues.

onetapbeyond commented 7 years ago

So it doesn't look like this workaround actually works as is. Your benchmark did indeed complete just fine but I suspect no actual metrics were gathered while your benchmarks ran. And the reason for this, based on what I'm seeing here, is that by changing the trace/trace_pattern from :return_trace to :return_to the fun reported in the {mod, fun, _arity} in the trace handler is always reported as try_dispatch.

The try_dispatch on the return side of the handlers does not match any of the opening fun which in a GenServer will invariably be one of handle_call, handle_cast, or handle_info. So no metrics are generated.

While I work out what if anything I should do about this I'd like to understand more about your test case that led to stack exhaustion. You described it as follows:

I tested this myself by creating an infinite Stream in my benchmark, and then iterating over that stream until I ran out of memory.

Did you iterate over that infinite stream within one of your GenServer callbacks that was being monitored by GenMetrics? I just want to understand as best I can what type of applications are susceptible as eventually this needs to be reflected in the docs if no workaround exists.

bitwalker commented 7 years ago

Sure, basically I changed my benchmark from enumerating a collection of 1M events to enumerating an infinite stream of events, only one event is being sent at a time, so the stream is never evaluated in full, but by having an infinite stream of events, we can ensure execution continues until memory is exhausted. I wasn't paying attention to resource usage when running the original benchmarks, and because execution completed before running out of memory, I never noticed the problem. Once I suspected the issue might be present, I switched it to the stream so I could easily reproduce by just running it and watching my system monitor.

Hope that helps!

onetapbeyond commented 7 years ago

That's crystal clear @bitwalker. I never reached that point either for the same reason, I simply wasn't loading the server (or pipeline) over a long enough period of time. Thanks again.

onetapbeyond commented 7 years ago

@bitwalker can you show me the code you used to enumerate an infinite stream of events in your benchmark test? I ask because when I created a simple infinite-stream benchmark test (even without any gen_metrics dependencies) and I saw that the Elixir.Benchee.Benchmark process would quickly run out of memory. Here is the most simple benchmark code I have used that quickly exhausts memory:

Benchee.run(%{time: 30, warmup: 5},
  %{"infinite-stream " => fn ->
  Stream.repeatedly(fn -> :ok end)
  |> Enum.take(99999999999999)
end})

Obviously this differs somehow from the approach you took in your own infinite-stream benchmark. Any clues as to where I've gone wrong much appreciated.

bitwalker commented 7 years ago

It's the Enum.take(huge number) which is killing you :)

s = Stream.cycle(list_of_events)
for e <- s, do: GenServer.call(pid, {:emit, e})

I'm on my phone, so I don't have the code from my benchmark handy, but the gist of the streaming bit is there.

onetapbeyond commented 7 years ago

Hi @bitwalker I think the issue you raised on this thread has been addressed as much as practically possible and the documentation and benchmarks now on the project make clear both the potential and limitations of the library. If you agree do you think we can close out this issue and perhaps the other RFC?

bitwalker commented 7 years ago

I agree :) Great work on this library!

onetapbeyond commented 7 years ago

Thanks @bitwalker. This library has come along leaps and bounds in large part thanks to your feedback and suggestions. All of it, much appreciated.

onetapbeyond / gen_metrics

Using gen_metrics will eventually cause the application to crash #3