Clarify the valid content in primitive type `string`

XSAM commented 3 months ago

What are you trying to achieve?

A clear definition of valid content, whether it allows arbitrary bytes (non-UTF-8) in string or not, for primitive type string of attribute on the common page.

A primitive type: string, boolean, double precision floating point (IEEE 754-1985) or signed 64 bit integer.

Additional context.

Baggage value only allows UTF-8 string. https://github.com/open-telemetry/opentelemetry-specification/blob/7145a5d2ae302070033c2088a2416812edb7f878/specification/baggage/api.md?plain=1#L48

Attribute type mapping defines a way to handle non-UTF-8 in a string, but its doc status is Experimental. https://github.com/open-telemetry/opentelemetry-specification/blob/7145a5d2ae302070033c2088a2416812edb7f878/specification/common/attribute-type-mapping.md?plain=1#L109-L113

Log data model defines the string represents as a Unicode string (UTF-8). https://github.com/open-telemetry/opentelemetry-specification/blob/7145a5d2ae302070033c2088a2416812edb7f878/specification/logs/data-model-appendix.md?plain=1#L492

There is no clear guideline to determine whether a string type allows arbitrary bytes or not.

XSAM commented 3 months ago

If we allow arbitrary bytes in the string, we need to support the bytes type. Otherwise, the arbitrary bytes value won't be accessible when SDK reads the date from OTLP. (protobuf won't allow arbitrary bytes in string)

pyohannes commented 3 months ago

A clear definition of valid content, whether it allows arbitrary bytes (non-UTF-8) in string or not

I'm not sure if that's desirable, the vagueness in the spec seems to give languages the freedom to use language-specific string types in OTel APIs.

For example, if the specification would require to only allow valid UTF-8 characters, than this might cause issues with Go, where the string type can hold arbitrary bytes. If the specification would require to support arbitrary bytes, then this would cause problems in Rust, where the native string type is always UTF-8 encoded.

The current specification gives some freedom to languages, but gives clear rules of how to map language-specific string encodings to OTLP.

pyohannes commented 3 months ago

There is also this: https://github.com/open-telemetry/opentelemetry-specification/issues/780

pellared commented 3 months ago

The vagueness in the spec seems to give languages the freedom to use language-specific string types in OTel APIs

I think it would be good to explicitly describe it in the specification.

XSAM commented 3 months ago

From today's SIG meeting:

We won’t expose the bytes type in API before having discussions with vendors, as they might not ready to process bytes attributes from OTLP

Since we won't accept bytes outside of the log body, I think we could clarify we only allow valid UTF-8 in string. So our users can have a clear understanding of the behavior after passing bytes into a string value, instead of not knowing the result until configuring it out it would be dropped by the processing chain (like protobuf lib).

This change wouldn't cause issues in languages like Go, where the string type can hold arbitrary bytes, as the protobuf lib will drop the arbitrary bytes anyway.

I also think we could consider dropping this. This brings inconsistent features to languages, where some languages that allow the string type can hold arbitrary bytes can add bytes to OTLP (because it can map string to bytes), but others cannot do this because the native string type is always UTF-8 encoded.

https://github.com/open-telemetry/opentelemetry-specification/blob/7145a5d2ae302070033c2088a2416812edb7f878/specification/common/attribute-type-mapping.md?plain=1#L109-L113

It also provides a less straightforward user experience, as an attribute's type can implicitly change from string to bytes in OTLP.

jack-berg commented 3 months ago

Discussed in the 3/27/24 TC meeting. There are two points to consider:

Whether additional specification is needed around the encoding of attribute string value types such that attribute APIs wouldn't accept non-UTF-8 strings. We think the existing language is sufficient, and reflects the reality that languages are different in how they represent strings and encodings. Limiting to UTF-8 seems overly restrictive.
Whether the spec for encoding non-UTF-8 strings is what we want. On one hand, it is does describe what to do quite clearly. On the other hand, consumers don't have information to determine the encoding, so its not very useful. We discussed options to drop non-UTF-8 encoded strings instead of sending as bytes, or to update the language such that the string encoded is sent as well.

Personally, I'm in favor of dropping non-UTF-8 bytes for the reason @XSAM mentions:

It also provides a less straightforward user experience, as an attribute's type can implicitly change from string to bytes in OTLP.

XSAM commented 2 months ago

Created a PR for dropping non-UTF-8 string. https://github.com/open-telemetry/opentelemetry-specification/pull/3970

pellared commented 2 months ago

It is language-specific how the string is encoded. For instance .NET encodes string as UTF-16. It is not a an issue as OTLP exporters are encoding them to UTF-8.

jack-berg commented 2 months ago

This has already been discussed in several spec SIGs and more work is needed to decide what direction we should go in. Options have included:

Drop non-UTF-8 strings
Keep current encoding, which converts non-UTF-8 strings to bytes without any metadata describing their original encoding
Adjust encoding, so that there is additional metadata describing the encoding of non-UTF-8 strings

Need more info about the issue and options before accepting.

XSAM commented 2 months ago

Based on the comments from #3970, which was trying to drop non-UTF-8 converting and mandate Unicode in the string, dropping non-UTF-8 converting and mandating Unicode in the string are considered as breaking changes.

If dropping non-UTF-8 converting is not feasible, then it means we will have bytes type in attributes even if users originally used the string byte.

https://github.com/open-telemetry/opentelemetry-specification/blob/700f5138f4b10c6e9d11b41153f6e164c61f581d/specification/common/attribute-type-mapping.md?plain=1#L109-L113

For that, I suggest we could go for option 2, Keep current encoding, which converts non-UTF-8 strings to bytes without any metadata describing their original encoding, but we should also expose the bytes type in API (extending the list of supported primitives to include bytes type).

Reasons:

It provides uniformity: Every language can push bytes into OTLP instead of certain languages that allow arbitrary data in the language's native string type.
Extending the list of supported primitives to include the bytes type may not be a breaking change.
The reason for rejecting exposing bytes type in the API in a previous SIG meeting is our vendor might not ready to process bytes attributes from OTLP. But this is not valid if dropping non-UTF-8 converting is not feasible, it means we will have bytes in attributes in OTLP anyway regardless whether to expose the bytes type in the API.

jack-berg commented 2 months ago

Discussed in the 4/16/24 spec SIG.

In languages like go, strings are nothing but a byte array so the idea of being able to include encoding information doesn't really work because its unavailable. The benefit of the current spec'd approach of sending non-UTF-8 strings as byte values is that something is better than nothing: even if the encoding of the bytes isn't present, the information is still there and you can probably make sense of it with some a priori knowledge about the instrumentation which recorded it. Yet there are potential security considerations with letting raw unvalidated byte arrays escape. I won't try to articulate what that might look like but folks on the call expressed potential concerns.

A good compromise seems to be:

Keep the current behavior of encoding non-UTF-8 strings as byte value
Encourage SDKs to offer an opt-out, where non-UTF-8 strings are dropped

jmacd commented 2 months ago

Keep the current behavior of encoding non-UTF-8 strings as byte value

Can we also specify that the OpenTelemetry Collector have an opt-in to support passing through invalid UTF-8? (This is a technical challenge, because protobuf libraries.) That's the problem with the current behavior -- I don't care if invalid UTF-8 arrives because it's better than nothing. However, I need a Collector that will support passing me that data.

XSAM commented 2 months ago

I just want to provide more info here. Sorry if I didn't make this clear in the meeting, I think sending non-UTF-8 strings as byte is bonded with exposing bytes API.

If we provide sending non-UTF-8 strings as byte, like as a stable behavior that is not removable, it makes sense to me to provide the bytes type because they are basically the same feature (allow users to push bytes data). But, just providing sending non-UTF-8 strings as byte without providing bytes API just gives languages like Go a privilege to push bytes.

I think the point of having sending non-UTF-8 strings as byte is not because of a certain use case in which users want to push bytes; it is because we don't want to drop users' data. And, the bytes API is not tight to certain use cases; it allows us to have a design to provide uniform features (push bytes).

BTW, the potential security considerations are also challenging the sending non-UTF-8 strings as byte since we have a way to push bytes. However, I am not sure about the details of the potential security case, like what exactly the case is.

pauldraper commented 2 months ago

Are we talking about a conceptual model, a language API, or a network protocol?

I find this project constantly confuses these.

The answer is that it is a conceptual model (which is then translated into APIs in various languages, and is the basis for a network protocol.)

And the conceptual model is Unicode text (or Unicode character string). Each language has it own prescription or convention of representing Unicode. Java has a native Unicode string type (java.lang.String), as does JavaScript and C#. Go uses a UTF-8 encoded byte array. C++ uses a UTF-8 encoded byte sequence (std::string). C uses UTF-8 encoded NUL-terminated byte arrays (if U+0000 is a prohibited character).

A prescribed "encoding" is irrelevant to the spec here and should be omitted. (But should be included in the OTel Go API definition, the OTLP specification, places where converting Unicode characters into octets is relevant.) If we want to specify requirements on the Unicode text (e.g. invalid characters) or specify that language SDKs should validate inputs, that would be relevant.

Let's stop leaking implementation details everywhere.

jmacd commented 1 month ago

@pyohannes Would like to respond to https://github.com/open-telemetry/opentelemetry-specification/issues/3950#issuecomment-2020398102 which specifically uses Rust as an example. I have just been oncall this week and responded to an issue for telemetry data dropping because a Rust SDK submitted invalid utf8 strings to a collector, which was the first in the pipeline to validate utf8.

Please see my proposal here: https://github.com/open-telemetry/opentelemetry-specification/issues/3421#issuecomment-2101117579

jmacd commented 1 month ago

@XSAM I would like to hear your thoughts on https://github.com/open-telemetry/oteps/pull/257, thanks.

open-telemetry / opentelemetry-specification

Clarify the valid content in primitive type `string` #3950