Open XSAM opened 3 months ago
If we allow arbitrary bytes in the string, we need to support the bytes
type. Otherwise, the arbitrary bytes value won't be accessible when SDK reads the date from OTLP. (protobuf won't allow arbitrary bytes in string)
A clear definition of valid content, whether it allows arbitrary bytes (non-UTF-8) in string or not
I'm not sure if that's desirable, the vagueness in the spec seems to give languages the freedom to use language-specific string types in OTel APIs.
For example, if the specification would require to only allow valid UTF-8 characters, than this might cause issues with Go, where the string type can hold arbitrary bytes. If the specification would require to support arbitrary bytes, then this would cause problems in Rust, where the native string type is always UTF-8 encoded.
The current specification gives some freedom to languages, but gives clear rules of how to map language-specific string encodings to OTLP.
There is also this: https://github.com/open-telemetry/opentelemetry-specification/issues/780
The vagueness in the spec seems to give languages the freedom to use language-specific string types in OTel APIs
I think it would be good to explicitly describe it in the specification.
From today's SIG meeting:
bytes
type in API before having discussions with vendors, as they might not ready to process bytes
attributes from OTLPSince we won't accept bytes
outside of the log body, I think we could clarify we only allow valid UTF-8 in string. So our users can have a clear understanding of the behavior after passing bytes into a string value, instead of not knowing the result until configuring it out it would be dropped by the processing chain (like protobuf lib).
This change wouldn't cause issues in languages like Go, where the string type can hold arbitrary bytes, as the protobuf lib will drop the arbitrary bytes anyway.
I also think we could consider dropping this. This brings inconsistent features to languages, where some languages that allow the string type can hold arbitrary bytes can add bytes to OTLP (because it can map string to bytes), but others cannot do this because the native string type is always UTF-8 encoded.
It also provides a less straightforward user experience, as an attribute's type can implicitly change from string
to bytes
in OTLP.
Discussed in the 3/27/24 TC meeting. There are two points to consider:
bytes
, or to update the language such that the string encoded is sent as well.Personally, I'm in favor of dropping non-UTF-8 bytes for the reason @XSAM mentions:
It also provides a less straightforward user experience, as an attribute's type can implicitly change from string to bytes in OTLP.
Created a PR for dropping non-UTF-8 string. https://github.com/open-telemetry/opentelemetry-specification/pull/3970
It is language-specific how the string is encoded. For instance .NET encodes string as UTF-16. It is not a an issue as OTLP exporters are encoding them to UTF-8.
This has already been discussed in several spec SIGs and more work is needed to decide what direction we should go in. Options have included:
Need more info about the issue and options before accepting.
Based on the comments from #3970, which was trying to drop non-UTF-8 converting and mandate Unicode in the string, dropping non-UTF-8 converting and mandating Unicode in the string are considered as breaking changes.
If dropping non-UTF-8 converting is not feasible, then it means we will have bytes
type in attributes even if users originally used the string
byte.
For that, I suggest we could go for option 2, Keep current encoding, which converts non-UTF-8 strings to bytes without any metadata describing their original encoding
, but we should also expose the bytes
type in API (extending the list of supported primitives to include bytes
type).
Reasons:
bytes
into OTLP instead of certain languages that allow arbitrary data in the language's native string type.bytes
type may not be a breaking change.
bytes
type in the API in a previous SIG meeting is our vendor might not ready to process bytes attributes from OTLP. But this is not valid if dropping non-UTF-8 converting is not feasible, it means we will have bytes
in attributes in OTLP anyway regardless whether to expose the bytes
type in the API.Discussed in the 4/16/24 spec SIG.
In languages like go, strings are nothing but a byte array so the idea of being able to include encoding information doesn't really work because its unavailable. The benefit of the current spec'd approach of sending non-UTF-8 strings as byte values is that something is better than nothing: even if the encoding of the bytes isn't present, the information is still there and you can probably make sense of it with some a priori knowledge about the instrumentation which recorded it. Yet there are potential security considerations with letting raw unvalidated byte arrays escape. I won't try to articulate what that might look like but folks on the call expressed potential concerns.
A good compromise seems to be:
Keep the current behavior of encoding non-UTF-8 strings as byte value
Can we also specify that the OpenTelemetry Collector have an opt-in to support passing through invalid UTF-8? (This is a technical challenge, because protobuf libraries.) That's the problem with the current behavior -- I don't care if invalid UTF-8 arrives because it's better than nothing. However, I need a Collector that will support passing me that data.
I just want to provide more info here. Sorry if I didn't make this clear in the meeting, I think sending non-UTF-8 strings as byte
is bonded with exposing bytes API
.
If we provide sending non-UTF-8 strings as byte
, like as a stable behavior that is not removable, it makes sense to me to provide the bytes
type because they are basically the same feature (allow users to push bytes
data). But, just providing sending non-UTF-8 strings as byte
without providing bytes
API just gives languages like Go a privilege to push bytes
.
I think the point of having sending non-UTF-8 strings as byte
is not because of a certain use case in which users want to push bytes
; it is because we don't want to drop users' data. And, the bytes
API is not tight to certain use cases; it allows us to have a design to provide uniform features (push bytes).
BTW, the potential security considerations are also challenging the sending non-UTF-8 strings as byte
since we have a way to push bytes
. However, I am not sure about the details of the potential security case, like what exactly the case is.
Are we talking about a conceptual model, a language API, or a network protocol?
I find this project constantly confuses these.
The answer is that it is a conceptual model (which is then translated into APIs in various languages, and is the basis for a network protocol.)
And the conceptual model is Unicode text (or Unicode character string). Each language has it own prescription or convention of representing Unicode. Java has a native Unicode string type (java.lang.String), as does JavaScript and C#. Go uses a UTF-8 encoded byte array. C++ uses a UTF-8 encoded byte sequence (std::string). C uses UTF-8 encoded NUL-terminated byte arrays (if U+0000 is a prohibited character).
A prescribed "encoding" is irrelevant to the spec here and should be omitted. (But should be included in the OTel Go API definition, the OTLP specification, places where converting Unicode characters into octets is relevant.) If we want to specify requirements on the Unicode text (e.g. invalid characters) or specify that language SDKs should validate inputs, that would be relevant.
Let's stop leaking implementation details everywhere.
@pyohannes Would like to respond to https://github.com/open-telemetry/opentelemetry-specification/issues/3950#issuecomment-2020398102 which specifically uses Rust as an example. I have just been oncall this week and responded to an issue for telemetry data dropping because a Rust SDK submitted invalid utf8 strings to a collector, which was the first in the pipeline to validate utf8.
Please see my proposal here: https://github.com/open-telemetry/opentelemetry-specification/issues/3421#issuecomment-2101117579
@XSAM I would like to hear your thoughts on https://github.com/open-telemetry/oteps/pull/257, thanks.
What are you trying to achieve?
A clear definition of valid content, whether it allows arbitrary bytes (non-UTF-8) in string or not, for primitive type
string
of attribute on the common page.Additional context.
Baggage value only allows UTF-8 string. https://github.com/open-telemetry/opentelemetry-specification/blob/7145a5d2ae302070033c2088a2416812edb7f878/specification/baggage/api.md?plain=1#L48
Attribute type mapping defines a way to handle non-UTF-8 in a string, but its doc status is Experimental. https://github.com/open-telemetry/opentelemetry-specification/blob/7145a5d2ae302070033c2088a2416812edb7f878/specification/common/attribute-type-mapping.md?plain=1#L109-L113
Log data model defines the string represents as a Unicode string (UTF-8). https://github.com/open-telemetry/opentelemetry-specification/blob/7145a5d2ae302070033c2088a2416812edb7f878/specification/logs/data-model-appendix.md?plain=1#L492
There is no clear guideline to determine whether a string type allows arbitrary bytes or not.
Related: https://github.com/open-telemetry/opentelemetry-go/pull/5089#pullrequestreview-1950213103