phoenixframework / phoenix_live_dashboard

Realtime dashboard with metrics, request logging, plus storage, OS and VM insights
MIT License
1.99k stars 182 forks source link

TelemetryListener Fails to Update Metrics Under High Event Load in Phoenix Live Dashboard #449

Open techvoyagerX opened 3 days ago

techvoyagerX commented 3 days ago

Environment

Elixir version (elixir -v): 1.15.2 Phoenix version (mix deps): 1.7.0 Phoenix LiveView version (mix deps): 0.18.2 Phoenix Dashboard version (mix deps): 0.7.0 Operating system: macOS 13.4 Ventura Browsers you attempted to reproduce this bug on: Chrome 115.0.5790.170, Firefox 116.0.2

Actual behavior

When utilizing the Metrics page with custom metrics (using Telemetry), the Dashboard intermittently fails to update charts in real-time under high load conditions (e.g., more than 1000 events per second). Specifically, the memory and CPU utilization metrics become stale, and the charts stop reflecting current data, even though the LiveView session remains active.

Here’s the stack trace when this occurs:

[error] #PID<0.477.0> running Phoenix.LiveView.Socket terminated
Server: localhost:4000 (http)
Request: GET /dashboard/metrics
** (exit) an exception was raised:
    ** (ArgumentError) argument error
        :erlang.apply/2
        (telemetry_poller) lib/telemetry_poller.ex:76: TelemetryPoller.execute/1
        (telemetry_poller) lib/telemetry_poller.ex:63: TelemetryPoller.loop/2
        (telemetry_poller) lib/telemetry_poller.ex:53: TelemetryPoller.start_link/1

This behavior is not observed with a lower event rate (below 500 events per second), suggesting that the issue might be related to an internal bottleneck or resource contention in handling the volume of Telemetry events.

Expected behavior

The Dashboard should continue to update charts in real-time, even under higher event rates, without any failure. Charts should accurately reflect live metrics regardless of system load, as expected from a production-level dashboard handling high-traffic applications.

josevalim commented 3 days ago

Thank you. Unfortunately the issue above does not give us enough information to what may be the root cause, it may not be related to the bottleneck. Is this in development or prod? The error makes me think this is related to hot code loading somehow.

The best would be to add some logs to TelemetryPoller and see which function exactly it fails to apply.