open-telemetry / opentelemetry-dotnet

The OpenTelemetry .NET Client
https://opentelemetry.io
Apache License 2.0
3.23k stars 765 forks source link

[sdk-logs] Add Transform API on batch #5733

Closed CodeBlanch closed 3 months ago

CodeBlanch commented 4 months ago

Changes

Details

Ran into an issue with a customer doing this (more or less):

        private ExportResult Export(in Batch<LogRecord> batch)
        {
            LogRecord[] storage = ArrayPool<LogRecord>.Shared.Rent((int)batch.Count);

            try
            {
                int storageCount = 0;
                int redactedFieldCount = 0;

                foreach (var logRecord in batch) // Drains the batch
                {
                    redactedFieldCount += ExecuteRedactionEngine(logRecord);
                    storage[storageCount++] = logRecord;
                }

                if (redactedFieldCount > 0)
                {
                    LogRedactedFieldSummary(redactedFieldCount);
                }          

                // Call inner exporter to actually perform the export
                return innerExporter.Export(
                    new Batch<LogRecord>(storage, storageCount)); // Create a new batch
            }
            finally
            {
                ArrayPool<LogRecord>.Shared.Return(storage);
            }
        }

ExecuteRedactionEngine is a bunch of logic for inspecting the log record and removing sensitive information. Customer specifically does this over the batch at the time of export to not block the application when it is creating the telemetry.

The issue here is while processing foreach (var logRecord in batch) the completed LogRecords may be returned to the pool (if they were rented from it).

What this PR is proposing to solve this is add a new API on Batch<T> which can be used to do this processing safely by mutating an instance of Batch<T>:

    private ExportResult Export(in Batch<LogRecord> batch)
    {
        var transformationState = new TransformationState
        {
             RedactionEngine = this.RedactionEngine,
        };

        batch.Transform(OnTransformLogRecord, ref this.RedactionEngine);

        if (transformationState.RedactedFieldCount > 0)
        {
            LogRedactedFieldSummary(transformationState.RedactedFieldCount);
        }   

        return innerExporter.Export(in batch);
    }

    private static bool OnTransformLogRecord(LogRecord logRecord, ref TransformationState state)
    {
        state.RedactedFieldCount += state.RedactionEngine.ExecuteRedactionEngine(logRecord);
        return true;
    }

    private struct TransformationState
    {
        public RedactionEngine RedactionEngine;
        public int RedactedFieldCount;
    }

Merge requirement checklist

codecov[bot] commented 4 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 86.34%. Comparing base (6250307) to head (5dbdced). Report is 285 commits behind head on main.

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/open-telemetry/opentelemetry-dotnet/pull/5733/graphs/tree.svg?width=650&height=150&src=pr&token=vscyfvPfy5&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=open-telemetry)](https://app.codecov.io/gh/open-telemetry/opentelemetry-dotnet/pull/5733?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=open-telemetry) ```diff @@ Coverage Diff @@ ## main #5733 +/- ## ========================================== + Coverage 83.38% 86.34% +2.96% ========================================== Files 297 254 -43 Lines 12531 11120 -1411 ========================================== - Hits 10449 9602 -847 + Misses 2082 1518 -564 ``` | [Flag](https://app.codecov.io/gh/open-telemetry/opentelemetry-dotnet/pull/5733/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=open-telemetry) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/open-telemetry/opentelemetry-dotnet/pull/5733/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=open-telemetry) | `?` | | | [unittests-Project-Experimental](https://app.codecov.io/gh/open-telemetry/opentelemetry-dotnet/pull/5733/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=open-telemetry) | `86.32% <100.00%> (?)` | | | [unittests-Project-Stable](https://app.codecov.io/gh/open-telemetry/opentelemetry-dotnet/pull/5733/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=open-telemetry) | `86.01% <100.00%> (?)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=open-telemetry#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/open-telemetry/opentelemetry-dotnet/pull/5733?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=open-telemetry) | Coverage Δ | | |---|---|---| | [src/OpenTelemetry/Batch.cs](https://app.codecov.io/gh/open-telemetry/opentelemetry-dotnet/pull/5733?src=pr&el=tree&filepath=src%2FOpenTelemetry%2FBatch.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=open-telemetry#diff-c3JjL09wZW5UZWxlbWV0cnkvQmF0Y2guY3M=) | `99.42% <100.00%> (+1.88%)` | :arrow_up: | | [...nTelemetry/Internal/OpenTelemetrySdkEventSource.cs](https://app.codecov.io/gh/open-telemetry/opentelemetry-dotnet/pull/5733?src=pr&el=tree&filepath=src%2FOpenTelemetry%2FInternal%2FOpenTelemetrySdkEventSource.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=open-telemetry#diff-c3JjL09wZW5UZWxlbWV0cnkvSW50ZXJuYWwvT3BlblRlbGVtZXRyeVNka0V2ZW50U291cmNlLmNz) | `80.16% <100.00%> (-1.74%)` | :arrow_down: | | [src/OpenTelemetry/Logs/Pool/LogRecordPoolHelper.cs](https://app.codecov.io/gh/open-telemetry/opentelemetry-dotnet/pull/5733?src=pr&el=tree&filepath=src%2FOpenTelemetry%2FLogs%2FPool%2FLogRecordPoolHelper.cs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=open-telemetry#diff-c3JjL09wZW5UZWxlbWV0cnkvTG9ncy9Qb29sL0xvZ1JlY29yZFBvb2xIZWxwZXIuY3M=) | `100.00% <ø> (ø)` | | ... and [196 files with indirect coverage changes](https://app.codecov.io/gh/open-telemetry/opentelemetry-dotnet/pull/5733/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=open-telemetry)
CodeBlanch commented 4 months ago

@reyang I made some changes to the design after our initial offline discussion. Please take a look when you have a moment and let me know how you feel about it now.

github-actions[bot] commented 4 months ago

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

Kielek commented 3 months ago

@CodeBlanch, the feature looks good, but after short verification with https://github.com/open-telemetry/opentelemetry-specification/tree/v1.35.0 there is no such transformer. I think that it should be defined on the specification level, then we could implement it. For now, it can be hidden by the unstable features.

@pellared, FYI as you working on similar API for the Go SDK.

pellared commented 3 months ago

FYI as you working on similar API for the Go SDK.

This looks like violation of the SDK specification which does not handle filtering. CC @open-telemetry/technical-committee

CodeBlanch commented 3 months ago

@pellared I hear you, but I don't think it is that cut and dry. This issue is caused by implementation details in .NET. Do other languages have a batch implementation based on a circular buffer backed by records from a pool? 😄 I looked at Java a bit. It seems the batch there is a simple ArrayList or Collection. If users want to massage that batch and then delegate the processing work to some other exporter they may filter, transform, or augment the items, no? This will not work as expected in .NET due to the implementation. The spec doesn't really say much at all about what shape the batch should have. Do we need it to be more opinionated? IMO the situation is we made some perf optimizations which lead to surprising behavior in code which would work with other SDKs. This API is to give users a way to do it safely. IMO it is well within our purview to do that as it is to specialize our implementation(s) in the first place.

PS: Just to be clear this isn't a hypothetical exercise. There are users doing this today and they are currently experiencing data loss 😢

pellared commented 3 months ago

Go batching log processor has also a circular buffer. I will do my best to take a look later and compare the implementations. If you want you can look as well: https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/log/batch.go and https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/log/exporter.go

then delegate the processing work to some other exporter they may filter, transform, or augment the items, no?

These are scenarios not supported by the specification (sic!) and probably by many SDKs as well. A little reference https://github.com/open-telemetry/opentelemetry-specification/pull/4067#issuecomment-2221593582

pellared commented 3 months ago

This API is to give users a way to do it safely. IMO it is well within our purview to do that as it is to specialize our implementation(s) in the first place.

May there other ways on how to do it?

The biggest concern is that you are not only adding "transform" capabilities but also "filtering" which is not covered by the specification. Why wouldn't you like to filter the records before they reach the batching processor?

I assume that adding a "transform" (without filtering capabilities) to the batching exporter will be totally acceptable given the implementation specifics (as you are using a pool).

CodeBlanch commented 3 months ago

@pellared

Why wouldn't you like to filter the records before they reach the batching processor?

I kind of touched on this in the description:

Customer specifically does this over the batch at the time of export to not block the application when it is creating the telemetry.

But let me expand on it.

Users could perform transformation via a processor before things hit the batch. This is actually way easier than filtering to do 🤣 But that is done synchronously before the "Log" call returns. In this case the transformation is complex. Lots of scanning and regex execution and who-knows-what-else. The user decided to do this over the batch instead because in the case of batch export processor this is done lazily on the dedicated thread. The idea is to defer the expensive processing so the impact is not felt every time a log is emitted. Perhaps the spec is a bit naive in its prescription of where this type of thing must be done?

I assume that adding a "transform" (without filtering capabilities) to the batching exporter will be totally acceptable given the implementation specifics (as you are using a pool).

This particular user actually does not need filtering. They just need transformation. I added the filtering bit. Because, why not? 😄 In my mind expensive filtering can also benefit from being performed lazily against the batch. If we are to add this API it felt natural to also let it filter. If we come back later and need a separate API to filter that makes it (potentially) two calls and some awkwardness in that we would have a transform API and a filter API but the filter API can always transform given LogRecord is mutable. I would be willing to drop filtering but I would prefer to be pragmatic (help users accomplish their scenarios) vs dogmatic (not have it just because the spec doesn't say we must/may/should).

May there other ways on how to do it?

Thinking out of the box. We could expose LogRecord.AddReference. If correctly called that would prevent the LogRecords from going back to the pool when taken from one batch and put into another. But I do feel that is a worse design. It isn't super friendly and leaks implementation details. Currently none of the pooling is exposed.

github-actions[bot] commented 3 months ago

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions[bot] commented 3 months ago

Closed as inactive. Feel free to reopen if this PR is still being worked on.