petabridge / phobos-issues

Public issues and bug tracker for Phobos®
https://phobos.petabridge.com/
2 stars 1 forks source link

Limiting granularity of Phobos distributed traces #46

Closed object closed 2 years ago

object commented 2 years ago

I tried to use Phobos-instrumented system with Honeycomb APM and quickly got into a problem with rate limit. Honeycomb sets maximum rate of 2000 events per second for a dataset (perhaps higher rate for Enterprise account, I didn't check) and our application exceeded that. While this is understandable because each incoming request results in a complex processing workflow across the Akka cluster nodes, the traffic in fact is relatively low.

I wonder whether you think Phobos tracing can be extended with activity/actor filtering so an Akka application can be configured to to sample just subsets of actors/activities. I haven't thoroghly thought about filtering principles, just trying to figure out how to deal with challenges like described above, when Phobos traces hit certain rate limit set by the APM provider.

Aaronontheweb commented 2 years ago

I'll write up a longer response to this, but there are definitely ways to accomplish this in Phobos 1.x through:

  1. Trace filtering
  2. Toggle tracing on/off for specific actors

And in Phobos 2.x you can do both of those plus use OpenTelemetry processors to filter out traces before they're exported to a service like HoneyComb. I'll get to work writing up a more detailed tuning guide on this subject.

object commented 2 years ago

Thank you. I wasn't aware of either trace filtering or toggle for specific actors. Will appreciated some details on how to achieve this.

Aaronontheweb commented 2 years ago

Here's the official literature we have so far: https://phobos.petabridge.com/articles/performance.html#performance-optimization-best-practices

But I can probably go deeper on that

object commented 2 years ago

Great, that's the good start for me.

object commented 2 years ago

Trying to get filtering to work, but still get unwanted messages. This is what I did:

                builder.WithTracing (fun b ->
                    b.SetTracer(tracer) |> ignore
                    b.AddIncludeMessageFilter<Nrk.Oddjob.Core.ShardMessages.MediaSetShardMessage>() |> ignore
                    b.AddIncludeMessageFilter<Nrk.Oddjob.Ps.PsShardMessages.PsShardMessage>() |> ignore
                    b.AddIncludeMessageFilter<Nrk.Oddjob.WebApi.HealthCheck.Actors.UpdateSubcomponentState>() |> ignore)

This is supposed to filter out all message except for the three explicitly specified types above, right? But I still see other messages, both from out domain and Akka system messages (e.g. Akka.Cluster.ClusterHeartbeatSender+Heartbeat) logged.

object commented 2 years ago

Hmm, tried AddIncludeMessageFilter, AddExcludeMessageFilter and AddMessageFilter (with ExcludeTypeFilterRule and IncludeTypeFilterRule), but I am still getting various message types logged.

Is the following understanding correct?

What happens if I try the combination of two? Anyway, looks like I can't get any filtering at all.

Aaronontheweb commented 2 years ago

So the filtering APIs have been, IMHO, intuitive at best and I'd like to simplify what we're doing there. Are you working against Phobos 2.0? I have more leeway to make major changes there and I think I can follow some of the instrumentation patterns that other OTel libraries are doing to accomplish the same goal. I'd be happy to prototype something today and show you what it looks like here.

object commented 2 years ago

I am still on Phobos 1.x. I am working on an OTel proof of concept to demo something tomorrow so I guess I won't be able to switch to Phobos now. So I appreciate a hint about what can be wrong with my filtering. But I can switch to Phobos 2.0 next week, if you think it will work better in that respect.

object commented 2 years ago

Right now I don't seem to get any filtering to work.

Aaronontheweb commented 2 years ago

I'll take a look at the 1.x bits right now then.

Aaronontheweb commented 2 years ago

So the way the filters work:

  1. By default, all messages that are already in a trace are included - this behavior can be changed via .IncludeMessagesAlreadyInTrace(false) on the builder. This drops a lot of trace traffic when this happens.
  2. As long as the traced message in Phobos matches any of your filters, it will be included in the trace.

Can you try .IncludeMessagesAlreadyInTrace(false) and see if that helps?

object commented 2 years ago

Thanks, will do.

Aaronontheweb commented 2 years ago

In the meantime, I'm still going to look at improving this as none of this is intuitive.

object commented 2 years ago

@Aaronontheweb I see that the number of irrelevant messages has drastically reduced. With a few exceptions, I only see messages from Include filter. Example of a message that is still logged is Akka.Cluster.Tools.PublishSubscribe.Internal.Status that is published by /system/distributedPubSubMediator (Akka.Cluster.Tools.PublishSubscribe.DistributedPubSubMediator).

Aaronontheweb commented 2 years ago

@object that's great to hear - so I just found a bug in Phobos based on your feedback here:

/// <summary>
    ///     INTERNAL API
    ///     Used to implement filter rules.
    /// </summary>
    internal static class ClusterCustomFilteringRules
    {
        public static readonly ImmutableHashSet<IFilterRule> DistributedPubSubFilter =
            ImmutableHashSet<IFilterRule>.Empty.Add(new IncludeFunctionFilterRule(o =>
            {
                switch (o)
                {
                    // covers the bulk of useful messages
                    case IDistributedPubSubMessage i when i is IWrappedMessage:
                    case Subscribe _:
                    case SubscribeAck _:
                    case Unsubscribe _:
                    case UnsubscribeAck _:
                    case Put _:
                    case Remove _:
                        return true;
                    default:
                        return false;
                }
            }));
    }

We implemented this but..... It is not currently being used inside any of the filters 🤦

We'll push an update which includes this

Aaronontheweb commented 2 years ago

In theory your filter should already cover this without this filter - so there's likely another issue with the DistributedPubSub filtering specifically that I need to look into.

Aaronontheweb commented 2 years ago

Ok, looks like @Arkatufus has patched this - we will push a new Phobos release in the next day or two.

Aaronontheweb commented 2 years ago

Update on this - @Arkatufus did indeed patch the DistributedPubSub issue but I decided that our filtering system, generally, isn't robust enough because it doesn't do the following things well:

  1. Distinguish between traces that are explicitly rejected versus "not handled" by a filter - which makes it easy for subsequent rules to accidentally override a previous rule that rejected the message. Only one rule needs to return true for the message to be traced today.
  2. Clearly explain the order in which rules will be applied (they are applied in the order in which they are declared, currently) - and this makes it confusing for teams to build a complete filtration system that is inherently understandable.
  3. Lastly, the behavior around how we handle messages when they are already part of a trace is distinct from whether we start a new trace altogether - this is not clear to end-users at all and is rather unwieldy.

I have a draft PR open right now that makes some breaking changes to the filtering system in order to rectify this.

This code is all for the 1.x branch of Phobos but it will look similar for 2.x. We will release this update for both versions concurrently.

var phobosConfig = new PhobosConfigBuilder()
  .WithTracing(m =>
  {
      m.SetTracer(tracer);

      // don't include messages in trace that don't satisfy other filters
      m.VerboseTracing(false); // this setting is new
      m.IncludeMessagesAlreadyInTrace(false);

      // only accept FilteredMessage types
      m.AddIncludeMessageFilter<FilteredMessage>();
  });

Using the current filtering API, this code means "don't trace any messages under any circumstances unless they implement the FilteredMessage base class."

var phobosConfig = new PhobosConfigBuilder()
          .WithTracing(m =>
          {
              m.SetTracer(tracer);

              // don't include messages in trace that don't satisfy other filters
              m.VerboseTracing(false);
              m.IncludeMessagesAlreadyInTrace(true); // set to true this time

              // only accept FilteredMessage types
              m.AddIncludeMessageFilter<FilteredMessage>();
          });

Using the current filtering API, this code means "don't trace any messages under any circumstances unless they implement the FilteredMessage base class OR they are already part of a trace."

Finally, if the builder pattern isn't easy enough to use we're going to offer the option to replace it altogether with a single ITraceFilter implementation:

/// <summary>
/// Each <see cref="ActorSystem"/> can implement exactly one of these to control
/// the decision-making around which traces to include and which ones to exclude.
///
/// When this type is configured, it will replace all of the <see cref="IFilterRule"/>s
/// that have been previously defined.
/// </summary>
/// <seealso cref="FilterDecision"/>
public interface ITraceFilter
{
    /// <summary>
    /// Evaluates a message processed by an actor and determines whether or not to include it in the current
    /// or a new trace.
    /// </summary>
    /// <param name="message">The message currently being processed by an actor.</param>
    /// <param name="alreadyInTrace">Whether or not this message is already part of an active trace.</param>
    /// <returns><c>true</c> if we are going to trace the message, <c>false</c> otherwise.</returns>
    /// <remarks>
    /// Should return <c>true</c> unless you are explicitly trying to filter a given message out of your trace.
    /// </remarks>
    bool ShouldTraceMessage(object message, bool alreadyInTrace);
}

All of the filtering context is passed in via the method and all of the user-defined filtering rules exist in a single location for this ActorSystem, structured in exactly the order implemented by the user. None of the settings from the builder pattern apply when a custom ITraceFilter is used.

I should have a small demo video of this done soon that shows it in action.

object commented 2 years ago

This looks good and reasonable @Aaronontheweb. I am already on Phobos 2, so once you have a beta that supports it, I can give it a try.

Aaronontheweb commented 2 years ago

Hi @object - we have pushed new Phobos 2.0 binaries that have an API change that should nip this problem in the bud.

Please see the release notes here and follow the configuration instructions - I think this should help quite a bit: https://phobos.petabridge.com/articles/releases/RELEASE_NOTES.html#200-beta4httpssdkbincompublisherpetabridgeproductphobospackagesphobosactorversions200-beta4-and-150-beta1httpssdkbincompublisherpetabridgeproductphobospackagesphobosactorversions150-beta1-march-17th-2022

Aaronontheweb commented 2 years ago

Hi @object , have you been able to give this a try yet?

object commented 2 years ago

@Aaronontheweb Sitting with a test project now, will give you an update today or tomorrow.

object commented 2 years ago

@Aaronontheweb What happens with the intermediate spans that don't satisfy filter conditions?

Scenario 1. "don't trace any messages under any circumstances unless they implement the FilteredMessage base class."

We have a transaciton that consists of spans [FilteredMessage1; NotFilteredMessage; FilteredMessage2] The transaction contains NotFilteredMessage that should be filtered out. Will NotFilteredMessage in the middle terminate the transaciton so only the first FilteredMessage will be seen? If not, then how the transaction will be shown in an APM dashboard? Or there will be two separate transactions [FilteredMessage1] and [FilteredMessage2]?

Scenario 2. "don't trace any messages under any circumstances unless they implement the FilteredMessage base class OR they are already part of a trace."

A: [FilteredMessage1; NotFilteredMessage:; FilteredMessage2] For 2.A all messages should be included. B: [NotFilteredMessage; FilteredMessage1; FilteredMessage2] What happens here? Will the transaction start from FilteredMessage1 and include [FilteredMessage1; FilteredMessage2]? C: [NotFilteredMessage1; FilteredMessage1; NotFilteredMessage2; FilteredMessage2] Will the transaction include [FilteredMessage1; NotFilteredMessage2; FilteredMessage2]?