Unique request id and other packet metadata in OTLP

We are developing our own system to accept metrics data in OTLP format and would like to have some means to troubleshoot customer issues related to missing/duplicate/slowly delivered metrics packets. There are several problems we would like to address here:

Quite often metrics data in the packet needs to be ingested and made available for consumption quickly, as we have an SLA for this, however, when SLA is broken it is very hard to tell why it was broken, as the data in the specific packet may be delayed both on the client and server sides and we need to distinguish these cases.
Public networks are not stable enough to ignore the fact that metric packets may get missing (smaller problem) or retried multiple times and succeed(!) multiple times (bigger problem, as this results in data duplication that is hard to detect).

Therefore we suggest to add metadata in the OTLP request that would at a minimum allow clients to send this information with each packet:

Unique request / packet id. The format and contents of this packet should be defined by the client, it should be unique per the blob of data inside. This will allow server to detect (and optionally reject) duplicates.
Retry attempt. This is useful to understand how bad the network connection is between client and server (or gateway).
Retry delay. This is a difference between the oldest metric timestamp in a packet and the time when request was sent to the server. In other words, this is the time the oldest metric in the packet spent on the client side before it was forwarded to the server. This will help in troubleshooting SLA violations. For repeated requests this time will increase by the retry wait time each time the retry is made.

This metadata should be cumulative, i.e. each application load balancer / gateway aware of OTLP traffic should add its own entry with the metadata above. If OTLP packets are merged or split by the gateway, this will still allow the server to track where final ingestion delay comes from for each metric in a packet.

Unique request / packet id. The format and contents of this packet should be defined by the client, it should be unique per the blob of data inside. This will allow server to detect (and optionally reject) duplicates.

This was discussed and rejected in the past due to problems with intermediaries.

I am open to discussing this again, but at the very least we need to address the questions that were raised in the past.

Retry delay. This is a difference between the oldest metric timestamp in a packet and the time when request was sent to the server.

Why does this need to be part of the protocol? It seems that the server can easily compute this from the request (assuming roughly that the "time sent"="time received").

If OTLP packets are merged or split by the gateway, this will still allow the server to track where final ingestion delay comes from for each metric in a packet.

Unclear how to maintain this cumulatively in the face of batching, splitting and other transformations that can happen in intermediaries.

Retry attempt. This is useful to understand how bad the network connection is between client and server (or gateway).

Alternatively, instead of making this a protocol feature, make it an observable metric of the client.

cc @jmacd @bogdandrutu

At Lightstep, we use a (hash of the) combination of all identifying fields, including for example metric datapoint timestamps, to compute an "idempotency key" for OTLP data. We use this information to avoid repeated application of metric data, for example. Although I also have (in the past) proposed adding a unique request ID, I no longer believe it's needed.

@tigrannajaryan, thank you for the quick review of this issue. Before I address your comments, I want to mention that my proposal comes from experience with an existing large metrics system, where similar concepts are already implemented in our own flow between client and server, but as we are looking to expand into OTLP world, we would like to have similar capabilities / SLA guarantees for the clients that will use OTLP instead of our own protocol and/or agent. These capabilities on a high-level help us solve two business problems:

Guarantee SLAs and research SLA breaches.
Address frequent questions from customers: "Why my metric shows no data (or wrong data) for this period of time?"

This was discussed and rejected in the past due to problems with intermediaries.

I am open to discussing this again, but at the very least we need to address the questions that were raised in the past.

I've reviewed that proposal and our use case is a bit different:

We don't guarantee and don't implement exactly-once delivery, our goal (yet not a guarantee) is at-most-once delivery, so data drop is preferable over data duplication. Detecting (and sometimes fixing) missing data is much easier than dealing with data duplication.
The unique id header is not being attached to any metrics data we store, it is currently only stored in the API calls log along with each ingestion API call. The log is huge and therefore we only promise to investigate related issues within 3 days of the data drop or data duplication report from the customer.

So the main use case for us here is to help ourselves help our customers: be able to research if data was never delivered or somehow received twice. We do have our own custom ingestion agent running on a client, which helps here (we can see when it e.g. reports connection timeout), but in agent-less scenario this header becomes even more important.

Also, this header is "cumulative" in a way that we have our own gateway (intermediary) facing the Internet and this gateway adds its own unique id, retry attempt and delay while preserving the same info from the client agent or SDK. This approach somewhat resembles distributed tracing.

Why does this need to be part of the protocol? It seems that the server can easily compute this from the request (assuming roughly that the "time sent"="time received").

There are two reasons for this:

We have an intermediary and I think other implementations might have it as well. As it adds its own latency, it also reports it the same way. This helps us to determine the slow component and also make a distinction between client time vs gateway time.
Even if we ignore intermediary, assuming that the "time sent"="time received" doesn't work for us. When measuring ingestion time (time at the final destination where metric is stored minus time label of the metric itself) we can only subtract the client time when considering if SLA was violated, we still include network time and it sometimes can be large (up to 30 seconds within 2-minute SLA). And the reason why we can't ignore client time (assume it is close to zero) and have to subtract it is because the customer can send the metrics that are back-dated, so in this case the whole packet will have a "retry delay" from the client which includes the difference between when metrics packet was sent out and the oldest back-dated metric. (I know it is possible to fake retry delay in client's favor, but let's assume client uses SDK that doesn't do this.)

Unclear how to maintain this cumulatively in the face of batching, splitting and other transformations that can happen in intermediaries.

Apologies, I think I wasn't clear here: the idea is that intermediary doesn't repackage OTLP (or any other) packets, it understands and writes only the "batch" format where each packet is a blob (black box) + some metadata on top of it including unique id and retry information. Opening the packets and repackaging them on intermediary would be too expensive.

Alternatively, instead of making this a protocol feature, make it an observable metric of the client.

In this case if the client has already serialized a packet and 1st attempt failed, it will need to serialize it again with the new "retry attempt" = 2 value. Sending retry attempt as e.g. a header is cheaper in this case (our agents send lots of metrics and using less CPU and memory on the client is important to us). Also, considering we could have an intermediary that doesn't usually open or even understand the format of the packets, we could only have the same information from the the intermediary outside the packet - it is simpler to have it outside the packet in both cases.

I hope this long-read clarifies where I am coming from with this proposal. If there are other ways OTLP can help solve original business problems above - I am willing to also consider those.

the idea is that intermediary doesn't repackage OTLP (or any other) packets, it understands and writes only the "batch" format where each packet is a blob (black box) + some metadata on top of it including unique id and retry information. Opening the packets and repackaging them on intermediary would be too expensive.

@alexander-shepel The most important OpenTelemetry's intermediary is OpenTelemetry Collector. The Collector deserializes the payloads and does all sorts of transformation to that the payload, then serializes again before sending. There is no good way to preserve the original request metadata in the Collector. It is unclear how your proposal will work with the Collector.

If you want to make a more elaborate proposal about the features you want to add we will be happy to review it, but please be aware that the Collector is an important use case for us and features that don't work well with the Collector will be difficult or impossible to get accepted.

Either way we will need a proper design document to review, authored by your or anyone else who wants these capabilities. I believe this is a non-trivial proposal that requires an OTEP (see https://github.com/open-telemetry/oteps).

Please make sure to address the following in the OTEP:

How it will work with the Collector.
How the proposed changes impact interoperability of OTLP implementations (old sender/receiver versions working with new versions).
What alternate solutions were considered that don't require changes to OTLP.

Also the 3 capabilities that you want don't seem to be interdependent, so they can be proposed as 3 different OTEPs that can be reviewed and accepted/rejected individually (unless there is some unifying underlying mechanism that requires them to be added at the same time).

@alexander-shepel I support the basic claim that without a request identifier in OTLP, that implementing correct at-most-once aggregation logic on the write path is fundamentally expensive. Lightstep would prefer if it did not have to compute such an "idempotency key" as we are currently doing.

@tigrannajaryan points out why this is a challenge for OTel. I want to point out that there are solutions here, it's just that they are also expensive. If an OTel collector could be configured with an "idempotency processor" that would ensure at-most-once delivery of OTLP data into its output pipeline, then the output request can use an entirely new request ID.

However, if the collector is configured with certain transformations and it does not ensure at-most-once semantics for its input, then it's easy to break the mechanism that Lightstep is using for idempotency itself breaking. I think the path forward for adding a request ID in OTel is to first clarify how to achieve at-most-once delivery with OTLP data.

@tigrannajaryan, @jmacd, thank you for your responses!

I understand that the proposal should focus on the Collector, so I was thinking how it could be generalized to support the Collector (or a chain of Collectors in a complex metric streams merge-split scenario) and here is a high-level idea, which I would like to get some quick feedback on before I go the OTEP way.

The core of the idea is to make Collector report latency metric about itself (I know this happens already), and also make it add this "standard" latency metric spans to each exported packet. Additionally, if the "standard" latency metric spans are present in the incoming packet, Collector could aggregates multiple incoming streams (where necessary), adds its own spans and export them further. I looked at the Collector documentation found this:

End-to-end latency (from receiver input to exporter output). Note that with multiple receivers/exporters we potentially have NxM data paths, each with different latency (plus different pipelines in the future), so realistically we should likely expose the average of all data paths (perhaps broken down by pipeline).

I think it is close to what I suggest, however, I couldn't find this metric in the Collector implementation, therefore let me assume how this latency could have been recorded and which attributes it might have:

Let each Collector add two spans to the "standard" latency metric of type Sum (assume delta temporality):

Span about "in-flight" time of this packet (added during parsing of the incoming packet):
1. StartTimeUnixNano = TimeUnixNano of the last span of the same metric in the incoming OTLP packet (or last span of any metric in this packet if latency metric is not present - should be the case for the first Collector in the processing chain).
2. TimeUnixNano = current time.
3. Value = 1.
4. Attributes: prior_collector_id and prior_packet_id (where available as collector_id and packet_id attributes from previous span in the incoming packet), collector_id, type = "in-flight".
5. In case Collector merges several streams (packets), it might either merge the spans of this metric from all packets together and also add its won span where values in the "prior_..." attributes are merged from all incoming packets as well (though I am not sure if this is going to result in a valid set of metric spans) or it may aggregate them somehow (in the very simple case just create a single "aggregate" span representing all prior steps). This is where more elaboration is required.
Span about "processing" time of this packet (added in exporter):
1. StartTimeUnixNano = TimeUnixNano of the span above.
2. TimeUnixNano = current time.
3. Value = 1.
4. Attributes: collector_id, packet_id (whatever exporter implementation wants to use as id - hash, GUID, etc), type = "processing".
5. In case there are several exporters, they will all send a copy of the spans of the latency metric to different destinations.

With this approach this "standard" latency metric becomes a first-class citizen in the target system (i.e. yet another metric), but it will also allow for some automation around it if all Collectors 1) implement such metric and 2) implement it in the same way and 3) export it out by default.

As for our specific asks, this solves SLA problem, as this metric will be the source of information about each Collector latency and depending on where the boundary between customer's and our own set of Collectors is, we could distinguish customer vs our own processing time. This approach may also be used to track duplicate packets (for example, Collector can track all "packet_id"s encountered during certain sliding time time in all latency metric spans, and report error when it sees the same id twice), but this is not a goal specifically for us - we will only record the "packet_id" that crosses the boundary between customer and our system.

open-telemetry / opentelemetry-specification

Unique request id and other packet metadata in OTLP #2862