microsoft / ApplicationInsights-Java

Application Insights for Java
http://aka.ms/application-insights
Other
295 stars 198 forks source link

Processor equivalence with 3.x to drop dependency based on a criteria #3102

Open mercer opened 1 year ago

mercer commented 1 year ago

Is your feature request related to a problem? Please describe. I'd like to drop sql dependency spans that are duration is under than a certain threshold. In 2.x and dotnet is easy to do using a TelemetryProcessor or ITelemetryProcessor.

I have digested https://learn.microsoft.com/en-us/azure/azure-monitor/app/java-standalone-telemetry-processors and I don't see how this would work.

Describe the solution you would like An example would be great. The documentation could also include more real-world examples.

Describe alternatives you have considered I considered downgrading to 2.x, but we need 3.x. We only have this problem in the java stack, not in .net

Additional context Nothing else I can think of.

heyams commented 1 year ago

@mercer can you try sampling overrides?

image

mercer commented 1 year ago

@heyams can you provide an example where SQL dependency get sampled if duration > 50 ms? So two parts for this problem

  1. the duration attribute
  2. the logic to match for sampling with a threshold, for example, value < 50

I'd appreciate an example here. (Already tried to get inspired from "make noisy dependency call example").

In the meanwhile will turn the self-diagnose to debug. However, I'd preffer not to reverse enginner this, and work from documentation, if possible.

mercer commented 1 year ago

So, the equivalent in 2.x would be something like

public class SqlDependencyFilterProcessor implements TelemetryProcessor {
    private final TelemetryProcessor next;
    private final SqlDependencyFilterOptions options;

    public SqlDependencyFilterProcessor(TelemetryProcessor next, SqlDependencyFilterOptions options) {
        this.next = next;
        this.options = options;
    }

    @Override
    public boolean process(com.microsoft.applicationinsights.telemetry.Telemetry telemetry) {
        if (options.isEnabled()
                && telemetry instanceof RemoteDependencyTelemetry
                && ((RemoteDependencyTelemetry) telemetry).getSuccess()
                && ((RemoteDependencyTelemetry) telemetry).getDuration().toMillis() <= options.getDurationThresholdMSecs()
                && "SQL".equalsIgnoreCase(((RemoteDependencyTelemetry) telemetry).getType())) {
            return false;
        }
        return next == null || next.process(telemetry);
    }
}

wired with

@Configuration
@EnableConfigurationProperties(SqlDependencyFilterOptions.class)
public class ApplicationInsightsConfiguration {

    @Bean
    public SqlDependencyFilterProcessor createSqlDependencyFilterProcessor(TelemetryProcessor next, SqlDependencyFilterOptions options) {
        return new SqlDependencyFilterProcessor(next, options);
    }

    @Bean
    public TelemetryProcessor telemetryProcessorChain(SqlDependencyFilterProcessor processor) {
        TelemetryProcessor baseProcessor = TelemetryConfiguration.getActive().getTelemetryProcessorChainBuilder().getBaseTelemetryProcessor();
        TelemetryConfiguration.getActive().getTelemetryProcessorChainBuilder().addLast(processor);
        TelemetryConfiguration.getActive().getTelemetryProcessorChainBuilder().build();
        return baseProcessor;
    }
}
mercer commented 1 year ago

A bit more context:

  1. Sometimes we have batch jobs. What we noticed is that the extra dependency calls adds about 150 $ in cost for each hour of batch. And that data is not particularly useful, unless these dependency calls have unexpected latency, or they fail. Sometimes these batches may take 5-24 hours.
  2. Now, in dotnet, we already solved this problem with an equivalent approach (using an ITelemetryProcessor)
  3. And, as we have already upgraded to 3.x in java, we want to fix this in the java stack 3.x as well.
heyams commented 1 year ago

@mercer i can come up with an example, but it will be helpful if you can share a sample app so that i can create a fix based on your app? My sql example's attributes will be different from yours. Or even better, let's have a quick call and I can show you how to locate the attributes and then apply sampling override? please email me at helen.yang@microsoft.com to further discuss.

heyams commented 1 year ago

@mercer can you try DCR?

You can apply filter rule on dependencies. It's via Log Analytics and the equivalent table is AppDependencie Please try the following rule and let us know if that works for your scenario:

source
| where Type != "SQL" or DurationMs > 100

Currently, we do not any filtering mechanism for dependencies based on duration. If data collection rule doesn't work for you, please get back to me so that my team will find an alternative solution.

mercer commented 1 year ago

@heyams thanks for your swift answer, I will try today your suggestion for data collection rules. I hope this solution solves the cost problem -- batches introduces anomalies in cost patterns with low value telemetry data, and this anomaly needs to be dealt with different sampling rules than "normal" traffic.

In the meanwhile, I had a few other questions regarding potential options, all the questions are in the context of 3.x java client.

  1. Is there a way to add a field at runtime in 3.x for dependencies (or any other traces)? For my use case, I could add the fact that it is a bulk, and then in applicationinsights.json I would sample on the custom field.
  2. Is there a way to change general sampling value dynamically at runtime? I would use this to react dinamically on the mode of the app, either automatically, or with a technical feature flag. I'm thinking here of any option other than re-generating applicationinsights.json and redeploying the app.
  3. Because applicationinsights-agent-3.4.13.jar includes the generic io.opentelemetry.javaagent code, is there a way to extend the code and override the behavior? I know you already answered there is no programatic filtering available, but I wondered if there is an option for us to build it ourselves, given the underlying library follows an open principle.
mercer commented 1 year ago

@heyams I did an evaluation for adding a rule, but I don't see how I can configure a rule to apply to data to be sent to an appinsights instance, as targeted by the connection string.

I'm prompted to provide a datasource, and I can't match any option to my expectation, that is, to have the rule apply to the appinsights instance.

For instance, I'd like to test the setup from a local instance of the app, connection to a custom appinsights instance, and see the rule in action.

image
heyams commented 1 year ago

@mercer there are 3 ways to create a DCR. can you follow this tutorial?

Each App Insights Resource has a link to workspace, which is on the overview blade on the Azure Portal.

heyams commented 1 year ago

@heyams thanks for your swift answer, I will try today your suggestion for data collection rules. I hope this solution solves the cost problem -- batches introduces anomalies in cost patterns with low value telemetry data, and this anomaly needs to be dealt with different sampling rules than "normal" traffic.

In the meanwhile, I had a few other questions regarding potential options, all the questions are in the context of 3.x java client.

  1. Is there a way to add a field at runtime in 3.x for dependencies (or any other traces)? For my use case, I could add the fact that it is a bulk, and then in applicationinsights.json I would sample on the custom field.

[heyams] you can try custom dimensions and then use sampling overrides to filter telemetry

  1. Is there a way to change general sampling value dynamically at runtime? I would use this to react dinamically on the mode of the app, either automatically, or with a technical feature flag. I'm thinking here of any option other than re-generating applicationinsights.json and redeploying the app.

[heyams] can you try something like this:

  1. create an attribute key for diff mode of the app
    Span.current().setAttribute("mode", "mode1");
  2. Put the following in the applicationinsights.json: more details on inherited attributes
{
  "inheritedAttributes": [
    {
      "key": "mode",
      "type": "string"
    }
  ]
}

Then each mode of the app will get tagged with "mode=mode1". "mode1" is the value was set in step 1.

  1. then you can use sampling override to change sampling rate based on that attribute key-value pair? Please give it a try.**
  1. Because applicationinsights-agent-3.4.13.jar includes the generic io.opentelemetry.javaagent code, is there a way to extend the code and override the behavior? I know you already answered there is no programatic filtering available, but I wondered if there is an option for us to build it ourselves, given the underlying library follows an open principle.

[heyams] please try out data collection rule, if that doesn't work, we can engage further discussion to find a solution that meet your needs. if you use a custom version of our agent, you will need to update it whenever we have a new release.

heyams commented 1 year ago

Is there a way to change general sampling value dynamically at runtime? I would use this to react dinamically on the mode of the app, either automatically, or with a technical feature flag. I'm thinking here of any option other than re-generating applicationinsights.json and redeploying the app.

@mercer regarding this question, I've suggested inherited attributes above. however, there is a better approach without requiring any code changes.

You can use custom dimensions

{
  "customDimensions": {
    "mytag": "appMode",
    "anothertag": "${ANOTHER_VALUE}"
  }
}

ANOTHER_VALUE is an env variable you set for your app. For each mode of your app, you can set to a different value. then you can use sampling override to change sampling rate based on this configuration. Hope that helps.

microsoft-github-policy-service[bot] commented 1 year ago

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 7 days. It will be closed if no further activity occurs within 7 days of this comment.

mercer commented 1 year ago

@heyams sorry for not responding earlier.

We felt like we can't make this work in a straight-forward way, and downgrading to 2.x wasn't the right call, as we already had some things setup in the 3.x fashion.

The way to mitigate the cost was to do a simple SQL dependency sample of 50%

{
  "preview": {
    "sampling": {
      "overrides": [
        {
          "telemetryType": "dependency",
          "attributes": [
            {
              "key": "db.system",
              "value": "mssql",
              "matchType": "strict"
            }
          ],
          "percentage": 50
        }
      ]
    }
  }
}

I think the 3.x rewrite is missing functionality, especially on custom processors. The sampling overrides is inferior to 2.x TelemetryProcessor, or to dotnet's ITelemetryProcessor. Before you could apply any logic to sampling (or anything else really), while now there are only a few predefined scenarios supported. I hope that this system will not be ported as is to dotnet.

Also, I believe the documentation can be improved. For example what are the fields (attributes) that one can configure the sampling overrides over.

In any case, thanks for all the time you put into answering my questions @heyams, I hope this ticket may help improve the 3.x appinsights client for java!

heyams commented 1 year ago

@mercer does DCR work? I will experiment something on the upstream side to see if I can come up with an alternative solution. In the meantime, please give DCR a try if you haven't tried yet. Thanks.

mercer commented 1 year ago

@heyams we did not invest more time into making DCR work either, because it seems too heavy for us.

We would need to provision these rules at subscription level, while this is just a service. So in order to have this in prod, we would need:

  1. decide ownership over the rules
  2. have a pipeline to provision the generic rules
  3. document the process
  4. test cross environments
  5. train DRIs
  6. and of course, make it work in the first place
mattmccleary commented 1 year ago

@mercer - Are you open to a 30-minute meeting to discuss why KQL Ingestion Tranforms is too heavyweight? We want to understand your scenario a bit better so we can improve. If so can you shoot me a quick email at mmcc@microsoft.com? I'll be back in the office 7/5, to respond and set up a call.

mercer commented 1 year ago

The scenario is the same as the initial description.

I'd like to drop sql dependency spans that are duration is under than a certain threshold. In 2.x and dotnet is easy to do using a TelemetryProcessor or ITelemetryProcessor.

Given that "drop sql dependency spans that are duration is under than a certain threshold" is already possible in the 2.x of java and in current dotnet appinisghts clients, then the need to add more infrastructure to solve a problem with 3.x is too heavyweight, even if it works.

I should be able to decide which spans leave my process in code.

I'm happy to discuss this requirement, but if the answer is add/configure infrastructure, the the process will remain heavyweight. Why shouldn't I be allowed to prevent 99% of telemetry traffic at source? I understand that there is an option to "fix" the problem further down the pipeline, in a generic way, for all data collected, and this may even be a way to prevent costs. However, this should be an option, not "the only way" to sample data.

I should be able to sample data at source based on any criteria -- again, this already works in 2.x java client and dotnet client, the capability is removed in 3.x java client due to rewrite to follow OpenTelemetry.

heyams commented 5 months ago

@mercer since 3.5 GA, we added support for the OpenTelemetry java extensions.

Now, you can use the extension to have your own span exporter and filter data based on any criteria. Here is my sample on filtering out spans based on duration. Please let me know if you can give it a try.

Sorry for taking this long to unblock you.

mercer commented 5 months ago

Had a look at https://github.com/Azure-Samples/ApplicationInsights-Java-Samples/tree/main/opentelemetry-api/java-agent/TelemetryFilteredBaseOnRequestDuration but I can't seem to find where I would configure that requests under 5s should not be ingested.

mercer commented 5 months ago

Is there a way to configure this for dependencies as well? My initial issue was to sample database dependencies that are under a threshold, say 10ms.

heyams commented 5 months ago

Had a look at https://github.com/Azure-Samples/ApplicationInsights-Java-Samples/tree/main/opentelemetry-api/java-agent/TelemetryFilteredBaseOnRequestDuration but I can't seem to find where I would configure that requests under 5s should not be ingested.

it's under extensions folder DurationSpanExporter

please read the readme.

-Dotel.javaagent.extensions=../extensions/FilterSpanBasedOnDuration/target/FilterSpanBasedOnDuration-1.0-SNAPSHOT.jar

main logic is in the ../extensions/FilterSpanBasedOnDuration.

heyams commented 5 months ago

Is there a way to configure this for dependencies as well? My initial issue was to sample database dependencies that are under a threshold, say 10ms.

yes, same idea. it's creating your own span exporter. you can filter any span based on any criteria.

mercer commented 5 months ago

Ok, do you have an example how I would differentiate a dependency from a trace?

In other words, using the example https://github.com/microsoft/ApplicationInsights-Java/issues/3102#issuecomment-1570285032, how would one port the code from 2.x to 3.x for this particular use case?

mercer commented 5 months ago

As a side-note, I think you should poopularize how 3.x java agent works with blog posts, technical documentation and so on, for example I find no blog posts today for AutoConfigurationCustomizerProvider. From the outside, it gives me the impresion that no one uses java version 3.x.