newrelic / go-agent

New Relic Go Agent
Apache License 2.0
768 stars 296 forks source link

Influence Sampled logic #968

Open aurelijusbanelis opened 4 days ago

aurelijusbanelis commented 4 days ago

Current Distributed Tracing Sampling is based on "magic" (assuming "most common request is most important")

While the business is running on:

image

Therefore, we want to "teach" SDK what is important to sample.

Summary

Current Sampling logic is not configurable

Therefore developers have to end up with hacks. It feels wrong paying for Observability platform and still need to have code like:

txn := newrelic.FromContext(ctx)
log := logrus.WithContext(newrelic.NewContext(ctx, txn))

if debugThisPage {
   log.WithField("url", u.String()).Debug("Outgoing call")
}

Desired Behaviour

txn := newrelic.FromContext(ctx)
txn.MarkKeyTransaction(true)

or

txn := newrelic.FromContext(ctx)
txn.PreferSampled(true)

Additional context

iamemilio commented 4 days ago

Hi, this is an interesting proposal. The way the agent determines how to weigh a transaction and the data within it is not tunable at the moment. We do think that it would make sense for us to give you the ability to mark something as important in the SDK if you are able to detect outside of New Relic. In general, the algorithm prioritizes 2 things: outliers (runtime, memory, etc) and errors. It is not able to "learn", and I don't think you want something running inside your application that could. I think it sounds like there are two problems here: Important transactions are not getting enough weight during sampling, and "junk" transactions seem to be getting too much weight.

  1. How to elevate the data we want: I can see an API call that allows you to elevate a transaction being a possible solution here. We could bump the weight of that transaction up by a certain number of points so it doesn't flood your samples, but would be far more likely to be consumed.

  2. Issues with "Junk" data crowding out other transactions: We have some questions here:

    • It looks like one trace type is getting selected for at a much higher rate than others. Is there an obvious reason for why that is? We'd like to understand why the top trace group seems to be possibly crowding other ones out.
    • Do you never want to collect certain transactions? If you know with certainty that a transaction is junk, it would make sense not to waste valuable space in a harvest sampling it.