Feature request: semantic conventions for (non-rejection) ingestion errors leading to truncation/mutation

michaelsafyan commented 3 months ago

Area(s)

area:telemetry

Is your change request related to a problem? Please describe.

If a backend telemetry system has certain limits on the size, number, etc. of attributes, it is possible to fail ingestion of the entire batch or of individual signals in the batch, but there is no standard way to surface these issues to users while accepting the signal albeit with some kind of truncation.

Describe the solution you'd like

I would suggest standardizing certain attributes related to instrumentation ingestion that provide a way to indicate that the signal was accepted but had to be mutated/modified/truncated in order to be accepted by the system.

As a strawman proposal, something along the lines of:

Standardize telemetry.backend.ingestion_log as the name of a special span event to be created by backends.
Standardize the following attributes of telemetry.backend.ingestion_log:
- severity: ERROR or WARNING
  - subject: names the property it is about (e.g. resource.attributes, span.attributes, events[0].attributes)
  - message: free-form message (e.g. "Limit exceeded", "Nearing limit", etc.)
  - limit: the numeric value of the limit if having to do with a limit
  - unit: if not a count, specifies the unit (e.g. bytes)
  - actual: the observed value if available
  - consequences: a list of records that describe various consequences like:
```
      {
          "type": "CONTAINER_DROPPED",  # e.g. dropped entire event
      }
```
```
      {
          "type": "ITEMS_DROPPED",  # e.g. dropped entire attributes
          "count": 50,
      }
```
```
      {
          "type": "ITEMS_TRUNCATED",  # e.g. attributes kept but modified
          "count": 50,
      }
```

Describe alternatives you've considered

More strict validation by backends (either accept or reject entire spans in whole).
Surface partial acceptance/mutation/truncation some other, vendor-specific, non-standard way.

With respect to 1, it would be good to give users more control over the behavior. A possible option would be to be lenient by default, but to support certain additional headers or options in OTLP for enabling stricter validation. If both modes exist, then there needs to be some way to surface ingestion errors.

With respect to 2, it is useful to have standardization here, especially as it relates to export; if truncation happens before export, it is useful for the errors to be represented in the OTel data format rather than some format outside of OTel.

Additional context

No response

dashpole commented 3 months ago

Does anyone know if the dropped_attributes_count, etc. fields are intended to be modified outside of the SDK? For example, if a lower attribute count limit is imposed by a backend, can/should it increment the dropped_attributes_count?

joaopgrassi commented 2 months ago

For 2:

Surface partial acceptance/mutation/truncation some other, vendor-specific, non-standard way.

This is already possible with the OTLP Partial success spec . That can tell how many were accepted vs rejected, and, via the error_message back-ends can give info on what was limited and etc.

We had discussions about introducing more fine-grained, typed response but that can get complicated very quickly - for example the receiver would need to keep an index or some sort of order to tell the exporter which log/metric/span was rejected and why. See this and this for some prior discussions.

michaelsafyan commented 2 months ago

Thanks for pointing to the partial success spec. This request is intended to cover gaps that exist with the current specification...

Firstly partial success addresses batch-level failure; it is possible to accept part of a batch. However, items within the batch are either accepted or rejected. There isn't a way to partially accept an individual span such as by accepting some of its attributes but not others.

Secondly, partial success/failure reports the failure to the client which may or may not be logging these failures in a way that is visible or obvious to downstream viewers/consumers of the information. When a span has been modified to become accepted, it is desirable for the warnings or errors related to it its ingestion (and the fact that the data may not be 100% faithful to what was originally written) to be surfaces and easily available in whatever context the span is available/displayed.

On Mon, Jun 17, 2024, 7:52 AM Joao Grassi @.***> wrote:

For 2:

Surface partial acceptance/mutation/truncation some other, vendor-specific, non-standard way.

This is already possible with the OTLP Partial success spec https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/specification.md#partial-success. That can tell how many were accepted vs rejected, and, via the error_message back-ends can give info on what was limited and etc.

We had discussion about introducing more fine-grained, typed response but that can get complicated very quickly - for example the receiver would need to keep an index or some sort of order to tell the exporter which log/metric/span was rejected and why. See this https://github.com/open-telemetry/opentelemetry-proto/issues/470 and this https://github.com/open-telemetry/opentelemetry-proto/issues/404 for some prior discussions.

— Reply to this email directly, view it on GitHub https://github.com/open-telemetry/semantic-conventions/issues/1098#issuecomment-2173630198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABI65OO34LUMPE6MC3GFSLZH3Z2DAVCNFSM6AAAAABIRNFHW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZTGYZTAMJZHA . You are receiving this because you authored the thread.Message ID: @.***>

joaopgrassi commented 2 months ago

There isn't a way to partially accept an individual span such as by accepting some of its attributes but not others.

Hum, why you think there's isn't a way? This is solely the responsability of the receiver - For example in Dynatrace we have validations on the attributes and their format. We accept telemetry that have invalid attributes by either dropping them or massaging them to fit our requirements. Both cases our OTLP APIs return a partial success, where these changes in the original telemetry are returned to the client. The partial success spec partially covers such cases - the error_message field can be used to surface such things:

Servers MAY also use the partialsuccess field to convey warnings/suggestions to clients even when the server fully accepts the request. In such cases, the rejected field MUST have a value of 0, and the error_message field MUST be non-empty.

Secondly, partial success/failure reports the failure to the client which may or may not be logging these failures in a way that is visible or obvious to downstream viewers/consumers of the information

This should be implemented in each SDK and to my knowledge it is. There's a entry in the compatibility matrix, so that can be used to keep track (not sure now if it's up-to-date). We also created issues in each repo to tell them SDKs should log partial success messages.

My feeling is that while it would be possible to come up with a consistent/conventions to surface such errors/warnings, I'm not sure we should or even makes sense to do it.

As I said, to be able to exactly pin-point which span/metric/log had issues, OTLP receivers need to keep state and all of this puts pressure in them. In high-load scenarios this is definitely not ideal. I feel what we have now with the partial success is a good middle ground that offers enough info to be able to troubleshoot and identify problems in the telemetry.

joaopgrassi commented 2 months ago

@michaelsafyan ping on this. I'm inclined to close this as nothing to do, but please let me know if you'd like to continue this discussion or have other arguments.

open-telemetry / semantic-conventions