open-telemetry / opentelemetry-python

OpenTelemetry Python API and SDK
https://opentelemetry.io
Apache License 2.0
1.66k stars 568 forks source link

Incorrect StartTimeUnixNano for ExponentialHistogram with Delta Temporality after gRPC connection drop #3974

Open alexchowle opened 2 weeks ago

alexchowle commented 2 weeks ago

Describe your environment

OS: CentOS 7 Python version: 3.11.1 SDK version: 1.25.0 API version: 1.25.0

What happened?

As part of investigating https://github.com/open-telemetry/opentelemetry-python/issues/3971 I found that the loss and subsequent re-establishment of a gRPC connection to a Collector caused the next metric export to have a very old StartTimeUnixNano value

Steps to Reproduce

metrics_temporality = {
        Histogram: AggregationTemporality.DELTA
    }
otlp_metric_exporter = OTLPMetricExporter(endpoint=grpc_collector_address, insecure=True,
                                              preferred_temporality=metrics_temporality, timeout=30_000)
metric_reader = PeriodicExportingMetricReader(otlp_metric_exporter,
                                                  export_interval_millis=60_000))
meter_provider = MeterProvider(metric_readers=[metric_reader], resource=_create_resources(),
                                   views=[View(instrument_type=Histogram,
                                               aggregation=ExponentialBucketHistogramAggregation())])
metrics.set_meter_provider(meter_provider)

meter = metrics.get_meter(__name__)

h = meter.create_histogram(
        name='my_histogram', unit='s', description='just my histogram'
    )

h.record(1.0)
time.sleep(360)
h.record(2.0)

Have the OTel Collector drop the client's gRPC connection by setting the max_connection_age to 5m (as documented at https://github.com/open-telemetry/opentelemetry-collector/blob/main/config/configgrpc/README.md#server-configuration)

So the first recording happens, then within the 6-minute sleep the Collector drops the client's gRPC connection within 5 minutes, then the client sends a second observation after the 6 minutes has elapsed.

Expected Result

StartTimeUnixNano has a value that is within 60 seconds of the 2nd recording. This is how the Golang SDK works.

Actual Result

StartTimeUnixNano has the TimeUnixNano of observation # 1 i.e. about 6 minutes in the past.

Observation # 1
-----------------
Metric #0                                                                                                
Descriptor:                                                                                              
     -> Name: http.server.request.duration                                                               
     -> Description: Duration of HTTP server requests                                                    
     -> Unit: s                                                                                          
     -> DataType: ExponentialHistogram                                                                   
     -> AggregationTemporality: Delta                                                                    
ExponentialHistogramDataPoints #0                                                                        
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-14 06:25:40.301137322 +0000 UTC
Timestamp: 2024-06-14 06:25:53.980223433 +0000 UTC
Count: 1
Sum: 0.021235
Min: 0.021235
Max: 0.021235
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.021235, 0.021235], Count: 1

Observation #2
-----------------
Metric #0
Descriptor:
     -> Name: http.server.request.duration
     -> Description: Duration of HTTP server requests
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-14 06:25:53.980223433 +0000 UTC
Timestamp: 2024-06-14 06:31:53.98886037 +0000 UTC
Count: 1
Sum: 0.018851
Min: 0.018851
Max: 0.018851
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.018851, 0.018851], Count: 1

Additional context

No response

Would you like to implement a fix?

None

alexchowle commented 1 week ago

I've performed more testing and it seems the gRPC disconnection is irrelevant; the aged StartTimeUnixNano is used regardless.

xrmx commented 3 days ago

@alexchowle care to give #3978 a run?

alexchowle commented 3 days ago

@alexchowle care to give #3978 a run?

I've ran the reproduction scenario and am still seeing the same issue.

Observation # 1
-----------------
Metric #0                                                                                                
Descriptor:                                                                                              
     -> Name: http.server.request.duration                                                               
     -> Description: Duration of HTTP server requests                                                    
     -> Unit: s                                                                                          
     -> DataType: ExponentialHistogram                                                                   
     -> AggregationTemporality: Delta                                                                    
ExponentialHistogramDataPoints #0                                                                        
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-26 18:50:51.262250586 +0000 UTC
Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC
Count: 1
Sum: 1.000000
Min: 1.000000
Max: 1.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.999999, 1.000000], Count: 1

Observation #2
-----------------
Metric #0
Descriptor:
     -> Name: http.server.request.duration
     -> Description: Duration of HTTP server requests
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC
Timestamp: 2024-06-26 18:56:51.263023031 +0000 UTC
Count: 1
Sum: 2.000000
Min: 2.000000
Max: 2.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (1.999999, 2.000000], Count: 1

I'm pretty confident I'm using the correct whl files: image

I'm picking up the new code: image

I'm thinking I'm doing something wrong with my builds/imports of the wheels.

ocelotl commented 2 days ago

Observation # 1
-----------------
Metric #0                                                                                                
Descriptor:                                                                                              
     -> Name: http.server.request.duration                                                               
     -> Description: Duration of HTTP server requests                                                    
     -> Unit: s                                                                                          
     -> DataType: ExponentialHistogram                                                                   
     -> AggregationTemporality: Delta                                                                    
ExponentialHistogramDataPoints #0                                                                        
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-26 18:50:51.262250586 +0000 UTC
Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC
Count: 1
Sum: 1.000000
Min: 1.000000
Max: 1.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.999999, 1.000000], Count: 1

Observation #2
-----------------
Metric #0
Descriptor:
     -> Name: http.server.request.duration
     -> Description: Duration of HTTP server requests
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
Data point attributes:
     -> http.request.method: Str(POST)
     -> http.route: Str(/sources/raw)
     -> http.response.status_code: Int(202)
     -> url.scheme: Str(http)
StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC
Timestamp: 2024-06-26 18:56:51.263023031 +0000 UTC
Count: 1
Sum: 2.000000
Min: 2.000000
Max: 2.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (1.999999, 2.000000], Count: 1

From the data above you are making 2 observations, with the following timestamps:

> Observation # 1
> -----------------
> ...
> StartTimestamp: 2024-06-26 18:50:51.262250586 +0000 UTC
> Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC
> ...
> Observation # 2
> -----------------
> ...
> StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC
> Timestamp: 2024-06-26 18:56:51.263023031 +0000 UTC
> ...

Observation # 1 Timestamp matches Observation # 2 StartTimestamp: Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC == StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC.

This is expected, the start time of a DELTA temporality collected exponential histogram data point is the value of the collection time of the previous DELTA temporality collected exponential histogram data point.

For this behavior to change to match the Golang behavior the entire _ExponentialBucketHistogramAggregation would need to be reinstantiated after the first point is collected and before the second point is collected. I wonder if that is what Golang is doing, I imagine something like that happens when the gRPC connection drops, but I am not sure if that behavior is defined somewhere in the spec :shrug:

I'll ask @jmacd for guidance here.

alexchowle commented 2 days ago

I only noticed this issue because my vendor backend uses StartTimeUnixNano as the observation timestamp. My service has a very low throughput.

ocelotl commented 2 days ago

Sorry, please disregard my previous answer I need to do more thinking about this issue, and consider the export interval of the periodic exporter, will comment here when I have a better answer :v:

ocelotl commented 2 days ago

@alexchowle I have created a branch that I think should fix your issue: #4002

I am just adding a commit so that the previous collection time gets updated with every collection regardless of aggregate being called between collection cycles or not.

Please give it a try :v:

jack-berg commented 1 day ago

This is expected, the start time of a DELTA temporality collected exponential histogram data point is the value of the collection time of the previous DELTA temporality collected exponential histogram data point.

I disagree. The supplementary guidelines has a delta example which I believe covers this situation. Notice how the start / end time of the 3rd export is (T2, T3] despite there being no measurements in the previous interval. If you're interpretation was correct, we would see the 3rd export have a start / end time of (T1, T3].

ocelotl commented 1 day ago

This is expected, the start time of a DELTA temporality collected exponential histogram data point is the value of the collection time of the previous DELTA temporality collected exponential histogram data point.

I disagree. The supplementary guidelines has a delta example which I believe covers this situation. Notice how the start / end time of the 3rd export is (T2, T3] despite there being no measurements in the previous interval. If you're interpretation was correct, we would see the 3rd export have a start / end time of (T1, T3].

Please take a look here

alexchowle commented 19 hours ago

@alexchowle I have created a branch that I think should fix your issue: #4002

I am just adding a commit so that the previous collection time gets updated with every collection regardless of aggregate being called between collection cycles or not.

Please give it a try ✌️

I've just tried it. This looks very promising:

Observation # 1

2024-06-29T10:12:33.955Z        info    ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> telemetry.sdk.language: Str(python)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.26.0.dev0)
     -> service.name: Str(otelworkbench)
     -> service.version: Str(0.1.0)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope __main__ 
Metric #0
Descriptor:
     -> Name: my_histogram
     -> Description: just a histogram
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
StartTimestamp: 2024-06-29 10:11:33.902642238 +0000 UTC
Timestamp: 2024-06-29 10:12:33.897396523 +0000 UTC
Count: 1
Sum: 1.000000
Min: 1.000000
Max: 1.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.999999, 1.000000], Count: 1
        {"kind": "exporter", "data_type": "metrics", "name": "debug"}

Observation # 2

2024-06-29T10:16:34.014Z        info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-06-29T10:16:34.014Z        info    ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> telemetry.sdk.language: Str(python)
     -> telemetry.sdk.name: Str(opentelemetry)
     -> telemetry.sdk.version: Str(1.26.0.dev0)
     -> service.name: Str(otelworkbench)
     -> service.version: Str(0.1.0)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope __main__ 
Metric #0
Descriptor:
     -> Name: my_histogram
     -> Description: just a histogram
     -> Unit: s
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
StartTimestamp: 2024-06-29 10:16:33.904149745 +0000 UTC
Timestamp: 2024-06-29 10:16:34.00260972 +0000 UTC
Count: 1
Sum: 2.000000
Min: 2.000000
Max: 2.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (1.999999, 2.000000], Count: 1
        {"kind": "exporter", "data_type": "metrics", "name": "debug"}