open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
274 stars 175 forks source link

Seeking input on proposal for "reference"-type attribute values. #1428

Open michaelsafyan opened 1 month ago

michaelsafyan commented 1 month ago

Area(s)

area:new

Is your change request related to a problem? Please describe.

Some attributes may may be very large such as:

In order to provide a very tight SLO and to provide consistent performance, an operations backend might provide very tight limits on the size of payloads/content/attributes. However, it may still be desirable when this is the case to be able to index most of the properties/attributes as well as to cross-link to an alternative storage solution (with a looser SLO) for full, large data.

In the OTel Collector Contrib repository, there is an open issue for a new connector component ("New component: Blob Attribute Uploader Connector") which attempts to resolve the above issue by proposing a connector that will, in the instrumentation/collector, do the following:

  1. Upload the full attribute data to a blob storage system (such as Google Cloud Storage, Amazon S3, Azure Blob storage, or -- in the future -- any other blob storage backend supported by the CDK or that chooses to contribute to the connector).

  2. Remove the original attribute that would otherwise be too large.

  3. Inject attribute(s) that contain references to where the data was uploaded.

It is with respect to this last one, that we need an established standard / data model for how to represent the uploaded data. The "SetInMap" function in the "foreign attribute" internal library of the draft connector shows the solution currently being implemented/pursued. However, I'd like to get more community feedback/input in order to ensure that there is agreement on the approach.

Describe the solution you'd like

Summary:

Before uploading After uploading
somekey somekey.ref.uri
somekey.ref.content_type [OPTIONAL]

Formally:

A backend system that sees an attribute matching the pattern "${prefix}.ref.uri" should assume that "${prefix}.ref.uri" contains a URI that reports that location of the original, full value of "${prefix}". If a key named "${prefix}" also is present, it is likely that "${prefix}" contains a truncated and/or redacted copy of the original value, with "${prefix}.ref.uri" pointing to the location where the original, unadulterated, full version of the value has been recorded. If "${prefix}.ref.uri" exists, then "${prefix}.ref.content_type" may also optionally exist, containing a MIME type describing the data stored at the URI in "${prefix}.ref.uri".

Prerequisites / Interactions:

If this course of action were to be taken, we would need two changes to the "Attribute Naming" specification in OTel:

  1. Reserve ".ref." so that it can only be used for this purpose (".ref." to be used only for information about reference-typed values and their metadata).

  2. Exempt this usage from the rule that "Names SHOULD NOT coincide with namespaces"

With respect to the latter exemption, our understanding is that this rule exists to allow coaelescing the flat attributes into a structured object. We believe that ".ref." fits the spirit of this even while violating the literal rule as it exists now, because the ".ref." does not introduce a conceptually new attribute; rather, it replaces the existing attribute with a different representation. Thus if a backend were to implement a coalescing system to make attributes non-flat, they could combine all of the ".ref." attributes into a single "ReferenceValue" type as the value for the top-level attribute that does not contain any ".ref." value in it.

Describe alternatives you've considered

Use a compound data type

For example, we could use a "KeyValueList kvlist_value" to represent the reference not as a bunch of separate attributes but as a single complex attribute value. However, this would go contrary to the attribute specification as well as existing OTel libraries which do not provide direct access to the underlying data model in OTLP following this convention restriction.

Introduce a new data type

We could attempt to introduce into OTLP (and into other parts of the specification) a new "ReferenceValue" data type. However, this would require significant effort across multiple languages, libraries, backend systems, etc. to support and thus seems like a non-starter.

Just move the large attributes to be events/logs

This assumes that the event/logs backend can, itself, accept arbitrarily large data. This assumption may not hold true for every event/log backend system. There is an inherent tradeoff between reliabilty/latency and the amount of data that a backend can allow; to provide a very tight latency SLO that caps 99%-ile latency, it's important to guarantee low variance in the size of requests which, in turn, may limit the amount of data that an event/log backend can accept (while being able to provide such a tight time bound). When using such a system for indexing the metadata about the events and making them quickly available for searching, it would still be useful to have a client-side mechanism to route larger content that is not critical for indexing to a blob storage system more suitable for large object storage and to be able to make it easy to interlink and navigate between backend systems by referencing the storage location.

Replace/upload just the entire body of events/logs

While this might serve the purpose of "gen_ai.prompt" / "gen_ai.completion" if/when it moves from a span event to a more general event, it would still be necessary to provide a data model for representing that the event body had been uploaded/replaced with a reference. In addition, this would not cover many other cases (e.g. "http.request.body.content"), and -- beyond this -- it is desirable to upload a much more targeted/limited portion of the data in order to allow indexing / searching the other content that does fit within backend subsystem limits.

Additional context

Examples

Example 1: http.response.body.content [Span Attribute]

Before

# TracesData [Before]
resource_spans: {
  resource: { … }
  scope_spans: {
     scope: { … }
     spans: {
         trace_id: …
         span_id: …
         …
         attributes: {
            key: "http.response.body.content"
            value: { string_value: "{ \"results\": [ … ] " }   # very long API JSON repsonse
          }
         }
         …
     }
  }
}

After

# TracesData [After]
resource_spans: {
  resource: { … }
  scope_spans: {
     scope: { … }
     spans: {
         trace_id: …
         span_id: …
         …
         attributes: {
            key: "http.response.body.content.ref.uri"
            value: { string_value: "gs://some-bucket/traceAttachments/12345/7890/response.json" }
          }
          attributes: {
             key: "http.response.body.content.ref.content_type"
             value: { string_value: "application/json" }
            }
         }
         …
     }
  }
}

Example 2: gen_ai.prompt / gen_ai.completion [Span Event Attribute]

Before

# TracesData [Before]
resource_spans: {
  resource: { … }
  scope_spans: {
     scope: { … }
     spans: {
         trace_id: …
         span_id: …
         …
         events: {
           # Note that this precise event name is mandatory per:
           # https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
           name: "gen_ai.content.prompt"
           # …
           attributes: {
              key: "gen_ai.prompt"
              value: { string_value: "Imagine that there is a very long text prompt here…." }
            }
           }
         }
        events: {
             # Note that this precise event name is mandatory per:
             # https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
             name: "gen_ai.content.completion"
             attributes: {
                 key: "gen_ai.completion"
                 value: { string_value: "{ \"completions\": [...]}"  }  # very long completion JSON
              }
           }
        }
         …
     }
  }
}

After

# TracesData [After]
resource_spans: {
  resource: { … }
  scope_spans: {
     scope: { … }
     spans: {
         trace_id: …
         span_id: …
         …
         events: {
           # Note that this precise event name is mandatory per:
           # https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
           name: "gen_ai.content.prompt"
           # …
           attributes: {
              key: "gen_ai.prompt.ref.uri"
              value: { string_value: "s3://bucket/some/path/to/the/prompt.txt" }
            }
            attributes: {
               key: "gen_ai.prompt.ref.content_type"
               value: { string_value: "text/plain" }
             }
           }
         }
        events: {
             # Note that this precise event name is mandatory per:
             # https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
             name: "gen_ai.content.completion"
             attributes: {
                 key: "gen_ai.completion.ref.uri"
                 value: { string_value: "azblob://account/container/path/to/response.json" }
              }
              attributes: {
                  key: "gen_ai.completion.ref.content_type"
                  value: { string_value: "application/json" }
              }
           }
        }
         …
     }
  }
}
lmolkova commented 1 month ago

Thanks for the proposal!

I think we really need some way to capture large data that, somewhere in the pipeline, can be uploaded to a storage that's better designed for such data.

A few feedback points on the details:

  1. We avoid having an unbound data in attribute values in semantic conventions. We don't have http.request|response.body.content defined and if someone proposed such attributes, I'd recommend defining request/response contents as events. One exception is log.record.original which might in theory be huge.

    I'd also recommend users put any large content into the event bodies - events volume could be controlled using common logging verbosity if someone does not want them, but for attributes someone would need to come up with custom feature-flags or drop attributes in a custom processor wasting performance on collecting them.

    So I think we need a solution for uploading an event body (or a specific jpath in the body) but it could be the same solution as for uploading an attribute value if that's also important.

  2. How:

    • I'd entertain the idea of putting ref. first to avoid collision. I.e. gen_ai.prompt would become ref.gen_ai.prompt. We can also document root ref (or similar) namespace for it. Or adding _ref suffix (e.g. gen_ai.prompt_ref) which would also not result in a collision.
    • Since we need this mechanism for events too, we might need to rename the event as a result since we're going to change the event structure. E.g. gen_ai.choice event would become ref.gen_ai.choice. The alternative would be to keep the same event name and just replace the removed value with null and add a reference as an attribute.
    • We'll need to figure out what to do with the original value. Perhaps it could be configurable and done by the processor.

In any case, I'm really supportive of this proposal with a caveat that large data should not be populated on span attributes and we should not encourage people to do it.

michaelsafyan commented 1 month ago

Thanks for the support of the general direction lmolkova!

Event body

I think we need a solution for uploading an event body (or a specific jpath in the body)

I emphatically agree. I would like it to be possible to perform this kind of write-aside solution with any relevant signal type (not just attributes, not just in spans). That would include uploading a subfield of the event body (targeted using a JSON Path, JMESPath, or similar). Ideally, this would be accomplished with the data model and implementation.

I'd also recommend users put any large content into the event bodies … a caveat that large data should not be populated on span attributes and we should not encourage people to do it

Our team is fully supportive of following OTel conventions and encouraging others to follow OTel conventions, so I don't think there is any conflict here in terms of what we recommend to our customers. Our team will aim to optimize the cases that follow OTel conventions and will encourage our customers to adopt said conventions, because they align with the "well lit path" in our systems which we prioritize. If the convention says that this data should be placed in an event body rather than in span attributes, then that is what we will encourage and do.

Original value

We'll need to figure out what to do with the original value

I agree with you that it ought to be configurable. Potential options might be to drop it entirely before sending to the telemetry destination or to truncate it (possibly with redaction) before sending along to the telemetry endpoint. The latter would make it easier for a telemetry backend to provide users with a preview of the content before linking to the full uploaded document.

Attribute naming

I think that there are a lot of different valid naming schemes. Some things which I consider important to the naming are:

  1. Ability to add metadata - the URI is all that should be mandatory, but I think it would be ideal if the naming scheme were flexible to being able to provide additional metadata about the object that was uploaded like content type, size, a hashing algorithm, hash.

  2. Implementation efficiency / flexibility - we should choose a naming structure/scheme that least constrains implementation and enables highly efficient implementation

  3. Not colliding with existing practices - we should ensure that whatever naming we choose does not overlap with any existing usage/names.

Between the two options you suggested (ref. as a prefix vs _ref as a suffix), I am more favorable to _ref as a suffix, because I think it might be useful for the reference attribute to sort along with the attribute it replaces. This way, an implementation that uses binary search to find attributes could minimize the amount of searching to locate the nearby _ref attributes.

However, I am somewhat concerned that an attribute ending in _ref could already exist (in vendor-specific or customer-specific attributes not captured in OTel Semantic Conventions documentation). I'm also wondering about the ability to add other metadata about the reference; would we, for example, need to ban the use of _ref_ anywhere in attribute names in order to support a _ref_uri, _ref_content_type, and _ref_metadata (for any new metadata) suffixes?

Event Naming

I was assuming that event names would remain unchanged and that event handlers should be able to gracefully handle missing expected attributes (and optionally add support for handling reference attributes in place of the original attributes). But perhaps I am missing something. Can you share more regarding why the event name would need to be changed? Is there a particular breakage to existing backends that you would expect this to cause? (Shouldn't backends already gracefully handle the case where expected attributes are absent?)

tedsuo commented 1 month ago

fwiw, while we do have some legacy span attributes that may contain large values, we have since clarified that the intention of attributes is for indexing, not for storing large blobs of data. It should be possible to store attributes in a column store without concern that this will create problems with values measured in megabytes.

Any solution that can avoid putting large values in attributes should be avoided, and we should focus on making these values event payloads.

michaelsafyan commented 1 month ago

Any solution that can avoid putting large values in attributes should be avoided

Agreed. This will help us deal with "legacy span attributes that may contain large values", by making them into smaller values (just a URI and possibly other metadata like a content-type) so that backends don't need to deal with the possibility of very large attribute values. [But, indeed, we should encourage a different approach for new data].

we should focus on making these values event payloads

Agreed. However, it is worth noting that this doesn't magically solve all of the problems for all backends. Some backends may also have limits on the size of such payloads, too. [Especially where GenAI use cases begin introducing things like entire PDFs, images, etc. in prompt/response logging].

Since event payloads and attributes use a common data model (AnyValue), I'm thinking that it should ideally be possible to use a common data model to describe a write-aside solution that uploads targeted pieces of the attributes and/or event bodies so that the particular signal type or field path in the signal isn't pertinent to the solution.

michaelsafyan commented 3 weeks ago

Just checking back in ... what are the next steps here?

michaelsafyan commented 1 week ago

It looks like the next step is to actually propose the update.

There is a pending PR in https://github.com/open-telemetry/semantic-conventions/pull/1521

It was brought up that VCS already uses .ref., so we will need a different segment.

Let's discuss the specific name alternatives in the pull request in https://github.com/open-telemetry/semantic-conventions/pull/1521