Open alexchowle opened 2 weeks ago
I've performed more testing and it seems the gRPC disconnection is irrelevant; the aged StartTimeUnixNano is used regardless.
@alexchowle care to give #3978 a run?
@alexchowle care to give #3978 a run?
I've ran the reproduction scenario and am still seeing the same issue.
Observation # 1
-----------------
Metric #0
Descriptor:
-> Name: http.server.request.duration
-> Description: Duration of HTTP server requests
-> Unit: s
-> DataType: ExponentialHistogram
-> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
Data point attributes:
-> http.request.method: Str(POST)
-> http.route: Str(/sources/raw)
-> http.response.status_code: Int(202)
-> url.scheme: Str(http)
StartTimestamp: 2024-06-26 18:50:51.262250586 +0000 UTC
Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC
Count: 1
Sum: 1.000000
Min: 1.000000
Max: 1.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.999999, 1.000000], Count: 1
Observation #2
-----------------
Metric #0
Descriptor:
-> Name: http.server.request.duration
-> Description: Duration of HTTP server requests
-> Unit: s
-> DataType: ExponentialHistogram
-> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
Data point attributes:
-> http.request.method: Str(POST)
-> http.route: Str(/sources/raw)
-> http.response.status_code: Int(202)
-> url.scheme: Str(http)
StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC
Timestamp: 2024-06-26 18:56:51.263023031 +0000 UTC
Count: 1
Sum: 2.000000
Min: 2.000000
Max: 2.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (1.999999, 2.000000], Count: 1
I'm pretty confident I'm using the correct whl
files:
I'm picking up the new code:
I'm thinking I'm doing something wrong with my builds/imports of the wheels.
Observation # 1 ----------------- Metric #0 Descriptor: -> Name: http.server.request.duration -> Description: Duration of HTTP server requests -> Unit: s -> DataType: ExponentialHistogram -> AggregationTemporality: Delta ExponentialHistogramDataPoints #0 Data point attributes: -> http.request.method: Str(POST) -> http.route: Str(/sources/raw) -> http.response.status_code: Int(202) -> url.scheme: Str(http) StartTimestamp: 2024-06-26 18:50:51.262250586 +0000 UTC Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC Count: 1 Sum: 1.000000 Min: 1.000000 Max: 1.000000 Bucket [-1.000001, -1.000000), Count: 0 Bucket (0.999999, 1.000000], Count: 1 Observation #2 ----------------- Metric #0 Descriptor: -> Name: http.server.request.duration -> Description: Duration of HTTP server requests -> Unit: s -> DataType: ExponentialHistogram -> AggregationTemporality: Delta ExponentialHistogramDataPoints #0 Data point attributes: -> http.request.method: Str(POST) -> http.route: Str(/sources/raw) -> http.response.status_code: Int(202) -> url.scheme: Str(http) StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC Timestamp: 2024-06-26 18:56:51.263023031 +0000 UTC Count: 1 Sum: 2.000000 Min: 2.000000 Max: 2.000000 Bucket [-1.000001, -1.000000), Count: 0 Bucket (1.999999, 2.000000], Count: 1
From the data above you are making 2 observations, with the following timestamps:
> Observation # 1
> -----------------
> ...
> StartTimestamp: 2024-06-26 18:50:51.262250586 +0000 UTC
> Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC
> ...
> Observation # 2
> -----------------
> ...
> StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC
> Timestamp: 2024-06-26 18:56:51.263023031 +0000 UTC
> ...
Observation # 1 Timestamp
matches Observation # 2 StartTimestamp
:
Timestamp: 2024-06-26 18:51:01.262261958 +0000 UTC == StartTimestamp: 2024-06-26 18:51:01.226261958 +0000 UTC
.
This is expected, the start time of a DELTA temporality collected exponential histogram data point is the value of the collection time of the previous DELTA temporality collected exponential histogram data point.
For this behavior to change to match the Golang behavior the entire _ExponentialBucketHistogramAggregation
would need to be reinstantiated after the first point is collected and before the second point is collected. I wonder if that is what Golang is doing, I imagine something like that happens when the gRPC connection drops, but I am not sure if that behavior is defined somewhere in the spec :shrug:
I'll ask @jmacd for guidance here.
I only noticed this issue because my vendor backend uses StartTimeUnixNano
as the observation timestamp. My service has a very low throughput.
Sorry, please disregard my previous answer I need to do more thinking about this issue, and consider the export interval of the periodic exporter, will comment here when I have a better answer :v:
@alexchowle I have created a branch that I think should fix your issue: #4002
I am just adding a commit so that the previous collection time gets updated with every collection regardless of aggregate
being called between collection cycles or not.
Please give it a try :v:
This is expected, the start time of a DELTA temporality collected exponential histogram data point is the value of the collection time of the previous DELTA temporality collected exponential histogram data point.
I disagree. The supplementary guidelines has a delta example which I believe covers this situation. Notice how the start / end time of the 3rd export is (T2, T3]
despite there being no measurements in the previous interval. If you're interpretation was correct, we would see the 3rd export have a start / end time of (T1, T3]
.
This is expected, the start time of a DELTA temporality collected exponential histogram data point is the value of the collection time of the previous DELTA temporality collected exponential histogram data point.
I disagree. The supplementary guidelines has a delta example which I believe covers this situation. Notice how the start / end time of the 3rd export is
(T2, T3]
despite there being no measurements in the previous interval. If you're interpretation was correct, we would see the 3rd export have a start / end time of(T1, T3]
.
Please take a look here
@alexchowle I have created a branch that I think should fix your issue: #4002
I am just adding a commit so that the previous collection time gets updated with every collection regardless of
aggregate
being called between collection cycles or not.Please give it a try ✌️
I've just tried it. This looks very promising:
2024-06-29T10:12:33.955Z info ResourceMetrics #0
Resource SchemaURL:
Resource attributes:
-> telemetry.sdk.language: Str(python)
-> telemetry.sdk.name: Str(opentelemetry)
-> telemetry.sdk.version: Str(1.26.0.dev0)
-> service.name: Str(otelworkbench)
-> service.version: Str(0.1.0)
ScopeMetrics #0
ScopeMetrics SchemaURL:
InstrumentationScope __main__
Metric #0
Descriptor:
-> Name: my_histogram
-> Description: just a histogram
-> Unit: s
-> DataType: ExponentialHistogram
-> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
StartTimestamp: 2024-06-29 10:11:33.902642238 +0000 UTC
Timestamp: 2024-06-29 10:12:33.897396523 +0000 UTC
Count: 1
Sum: 1.000000
Min: 1.000000
Max: 1.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (0.999999, 1.000000], Count: 1
{"kind": "exporter", "data_type": "metrics", "name": "debug"}
2024-06-29T10:16:34.014Z info MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-06-29T10:16:34.014Z info ResourceMetrics #0
Resource SchemaURL:
Resource attributes:
-> telemetry.sdk.language: Str(python)
-> telemetry.sdk.name: Str(opentelemetry)
-> telemetry.sdk.version: Str(1.26.0.dev0)
-> service.name: Str(otelworkbench)
-> service.version: Str(0.1.0)
ScopeMetrics #0
ScopeMetrics SchemaURL:
InstrumentationScope __main__
Metric #0
Descriptor:
-> Name: my_histogram
-> Description: just a histogram
-> Unit: s
-> DataType: ExponentialHistogram
-> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
StartTimestamp: 2024-06-29 10:16:33.904149745 +0000 UTC
Timestamp: 2024-06-29 10:16:34.00260972 +0000 UTC
Count: 1
Sum: 2.000000
Min: 2.000000
Max: 2.000000
Bucket [-1.000001, -1.000000), Count: 0
Bucket (1.999999, 2.000000], Count: 1
{"kind": "exporter", "data_type": "metrics", "name": "debug"}
Describe your environment
OS: CentOS 7 Python version: 3.11.1 SDK version: 1.25.0 API version: 1.25.0
What happened?
As part of investigating https://github.com/open-telemetry/opentelemetry-python/issues/3971 I found that the loss and subsequent re-establishment of a gRPC connection to a Collector caused the next metric export to have a very old StartTimeUnixNano value
Steps to Reproduce
Have the OTel Collector drop the client's gRPC connection by setting the
max_connection_age
to5m
(as documented at https://github.com/open-telemetry/opentelemetry-collector/blob/main/config/configgrpc/README.md#server-configuration)So the first recording happens, then within the 6-minute sleep the Collector drops the client's gRPC connection within 5 minutes, then the client sends a second observation after the 6 minutes has elapsed.
Expected Result
StartTimeUnixNano has a value that is within 60 seconds of the 2nd recording. This is how the Golang SDK works.
Actual Result
StartTimeUnixNano has the TimeUnixNano of observation # 1 i.e. about 6 minutes in the past.
Additional context
No response
Would you like to implement a fix?
None