Fix "...Span <span name> is GC'ed without being ended." issue (caused by a BT timeout) - Githubissues

spotify / heroic

The Heroic Time Series Database

https://spotify.github.io/heroic/

Apache License 2.0

848 stars 109 forks source link

Fix "...Span <span name> is GC'ed without being ended." issue (caused by a BT timeout) #761

Open sming opened 3 years ago

sming commented 3 years ago

100's of Tracing Spans are left un-ended from every query timeout

I am a prism goalie
Who wants to have a stable heroic
So that I can focus on features and not get woken up at night and have angry users

These un-ended spans represent a real runtime risk to heroic. If ~700-1000 of these are left hanging around after each timeout-d query, it's conceivable that the JVM will :

potentially run out of memory altogether
experience much longer GC pauses / sweep times (cos of all the hanging spans needing reaping)
hugely inflate the size of heroic's logs, costing us $$$ and obscuring "genuine" problems

Proposed Solution

find the correct location to catch the BT timeout exception (not trivial)
catch it, end the span and throw it out again

Repro Steps

run heroic locally with GUC config and on branch feature/add-bigtable-timeout-settings-refactored
capture a lengthy query from grafana using the chrome dev tools network tab
alter the query to hit localhost and watch the logs, you'll see this message

List of methods concerned from logs

ERROR io.opencensus.trace.Tracer - Span localMetricsManager.fetchSeries is GC'ed without being ended.
ERROR io.opencensus.trace.Tracer - Span bigtable.fetchBatch is GC'ed without being ended.

sming commented 3 years ago

FYI @adsail , moving to inbox as it's not something we'll need to tackle until more aggressive timeouts are deployed