Clarify that API users SHOULD use valid UTF-8 for exportable data

jmacd commented 2 years ago

In https://github.com/open-telemetry/opentelemetry-go/issues/3021, a user reports having spans fail to export because of invalid UTF-8.

I believe the SDK should pass this data through to the receiver, regardless of UTF-8 correctness. I am not sure whether gRPC and the Google protobuf library used by each SDK offers an option to bypass UTF-8 checking, but if so that would be one path forward if we agree.

Another option is to maintain (via sed, e.g.,) copies of the protocol that replace string with bytes; SDKs could use these definitions instead to bypass UTF-8 checking as well.

tigrannajaryan commented 2 years ago

Strings are UTF8 in Protobuf:

string := valid UTF-8 string, or sequence of 7-bit ASCII bytes; max 2GB
As described, a string must use UTF-8 character encoding. A string cannot exceed 2GB.

tigrannajaryan commented 2 years ago

So, bypassing SDK checks is not enough. We will be non-compliant with Protobuf spec and it may result in recipients failing to decode the data. Given that this is part of Stable spec I don't think we can change string to bytes anymore.

jmacd commented 2 years ago

Could we agree on a transformation to turn invalid UTF-8 into valid UTF-8? Should we reject those strings?

tigrannajaryan commented 2 years ago

Could we agree on a transformation to turn invalid UTF-8 into valid UTF-8? Should we reject those strings?

Yes, I think these are valid options to consider for the SDK functionality.

tsloughter commented 2 years ago

There is an issue for this elsewhere already in the spec I believe.

For this case in the Go issue I argued we should be limiting strings not by size but by grapheme so that they don't get cut off into invalid utf8 (or change the meaning of the utf8 string).

tsloughter commented 2 years ago

Forgot, actually in the Erlang SDK we don't truncate into invalid UTF-8 because we treat the length limit as the # of graphemes https://github.com/open-telemetry/opentelemetry-erlang/blob/main/apps/opentelemetry/src/otel_attributes.erl#L105

The issue is the spec says "character" and not "graphemes" https://github.com/open-telemetry/opentelemetry-specification/blob/ca593258953811eb33197976b8e565299b874853/specification/common/README.md#attribute-limits

jmacd commented 2 years ago

I believe there's a question of what we actually want to have, here.

The term character is reasonably well defined in relation to utf-8, but it's not the same as "grapheme" I think?

Since protobuf (w/ v3 syntax) is very strict about UTF-8 and we intend for OTLP/proto to be the standard, we are quite constrained. The Golang protobuf implementation rejects such data in both Marshal() and Unmarshal(), but some do not. Without correcting this problem at the protocol level, we are constrained to one or the other of:

Validate every string-valued field before generating OTLP, force correction for invalid UTF-8
Expect errors somewhere in the Export path

I am opposed to both of these options. The validate-and-force-correction path is simply expensive, especially when the SDK actually repairs invalid data. The expect-errors path is much worse, because generally the entire batch of data is rejected with little clue as to what the actual error was (note the error message in https://github.com/open-telemetry/opentelemetry-go/issues/3021 does not include detail).

Therefore, what I actually want is for SDKs to allow invalid UTF-8. While users are instructed they SHOULD not use invalid UTF-8, SDKs will be specified such that they SHOULD not themselves consider invalid UTF-8 to be an error. As mentioned in my earlier comment, I believe this can be accomplished by maintaining two variant .proto definitions, one where every string field is a wire-compatible bytes field instead.

How this solves the problem:

SDKs will be advised to use the bytes-form of OTLP so that their marshal and export routines remain ignorant of any UTF-8 problems.
Processing pipelines will be advised to use the bytes-form of OTLP so that they can process data with invalid UTF-8
Consumers will be advised that all the bytes fields SHOULD be interpreted as strings and that UTF-8 validation and correction is their responsibility.

Consumers then can accept data with invalid UTF-8, which I think is ultimately what we all want. I find myself wanting to extend the library guidelines on performance to address this situation. Where we currently write:

Although there are inevitable overhead to achieve monitoring, API should not degrade the end-user application as possible.

I want to add a statement like so:

Although users SHOULD use valid UTF-8 strings everywhere in the OpenTelemetry API, the use of invalid UTF-8 SHOULD NOT degrade the observability of telemetry data. SDKs SHOULD permit invalid UTF-8 to be exported without validating UTF-8, since that would degrade performance unnecessarily.

In addition, I would then add that SDKs SHOULD treat attribute "character limits" assuming though the character set is US-ASCII so that (as a performance concern) truncation logic need not parse UTF-8 code points or identify grapheme clusters.

tigrannajaryan commented 2 years ago

Processing pipelines will be advised to use the bytes-form of OTLP so that they can process data with invalid UTF-8

Isn't every recipient a processing pipeline from this perspective? This would mean a breaking change for every backend that supports receiving OTLP today.

I want to suggest an alternate statement:

Users SHOULD use valid UTF-8 strings everywhere in the OpenTelemetry API. The use of invalid UTF-8 results in undefined behavior.

So, dear user, you have been warned. Don't do it, if you do and your backend crashes, that's between you and your vendor.

This statement allows you to make choices in the SDKs. If you want to pay the cost and validate it you can do it. If you don't care and want to pass the problem to the recipients you can also do it.

tsloughter commented 2 years ago

"SHOULD" is a good compromise (I'd be on the side of requiring valid utf-8) but I'd argue that we should also state for the users that valid utf8 will not be broken.

jsuereth commented 2 years ago

We already specify that any non-valid UTF-8 string MUST be encoded as a bytes_value.

I'm against the chaos that would ensue with forcing every string to be raw bytes with no notion of encoding. I think the current specification around bytes_value (with no encoding flag) is already a bit rough.

Should we allow non-UTF8? Yes. But only for attribute values. Names, of all forms must be UTF-8 for our protocol to work thanks to protocol buffers.
Is it ok for API/SDK to re-encode characters for UTF-8 for names? Sure. This may be where the rub is. We should enforce this conversion happens early, so error messages reach users faster.
Should we modify OTLP to not use the string type in proto? I feel this is rather extreme for little gain. Values today can be in UTF-8, and our semantic conventions tend to also be in UTF-8. I realize this can be restrictive, but it cause major issues in using the protocol that I think are just not necessary.
Should we ensure users denote the encoding of non-UTF8 strings? I'd say yes, but the state of the world today is ok.

tigrannajaryan commented 2 years ago

We now have open issues in the spec to handle this in the API/SDK.

Can we close this issue?

tigrannajaryan commented 1 year ago

Closing. I assume nothing is needed in this repo.

jmacd commented 1 year ago

We now have open issues in the spec to handle this in the API/SDK.

Please link any spec issues, for the readers following this issue?

tigrannajaryan commented 1 year ago

Re-opening, can't find the issue I was referring to. :-(

tigrannajaryan commented 1 year ago

Moving this to spec repo since there is nothing to be done on the proto.

tigrannajaryan commented 1 year ago

Hey, bot, you can't be serious. This is the gratitude I get for cleaning up the house?

jmacd commented 5 months ago

I will elevate this topic again in the next Specification SIG meeting. Recently I observed the following interaction.

A Rust SDK emitted a string-valued attribute containing invalid utf8. This passed to an OTel collector, whose OTLP receiver parsed it as valid OTLP data. Note that gRPC receivers do not validate utf8, they assume the client does this.

The OTel collector passed this data to a vendor-internal exporter, which is based on gRPC. That exporter tried to send the same data further into the backend, but it failed because of the invalid utf8. It could be worse -- if the collector has a batching step, and the invalid utf8 data was combined with other valid data, the whole request will fail. It has become very easy to lose telemetry because some participants require valid utf8 while others do not.

What is the ideal outcome?

I think it would be unpopular if we added a requirement that SDKs MUST validate utf8, because it is an unnecessary expense. At the same time, valid utf8 is stated as a requirement and enforced by Google protobuf libraries. The compromise I want us to reach is where there is known utf8 enforcement, there MUST be an effort to prevent telemetry data from dropping.

OTel SDKs are obligated to repair the problem where there is known enforcement. As an example, if you are using the Go gRPC library, which requires valid utf8, then you must fix validation errors before passing data to the exporter. As a counter-example, if you are using the Rust tonic library (which does not enforce valid utf8), then you are not obligated to validate utf8.
API users are obligated to use non-string attribute formats where the encoding is not utf8. They may expect errors from backends that expect valid utf8 if they fail to do this correctly.
OpenTelemetry collector will be obligated to provide efficient means of enforcing valid utf8. This would likely be done with the help of code generation in the pdata framework, to automatically scan all strings before constructing invalid protobuf data, and through the component capabilities framework. Components either are or are not willing to accept invalid utf8 data, and if they aren't the collector is expected to automatically repair the problem. This could be accomplished through automatic insertion of a sanitizer component in every pipeline as early as possible.
Generally, OTLP receivers should return PartialSuccess responses to warn senders when they have sent invalid utf8. If the data was not automatically corrected, then the rejected count associated with the partial success should reflect the number of items that were not accepted due to invalid utf8.
Generally, an automatic correction step SHOULD replace any contiguous sub-string of invalid-utf8 bytes by a configurable string. The default shall be "�".

pellared commented 5 months ago

CC @XSAM

jmacd commented 5 months ago

I do see the connection with #3950, which appears equally unresolved. I am surprised to hear about vendors that accept OTLP data and are not prepared to accept byte slice data, which is already a part of the OTLP data model.

I will move forward my proposal above, which would resolves #3950 by requiring strings to be valid utf8 AND requiring SDKs to support byte slice data AND requiring SDKs and collectors to enforce validity where there is known enforcement.

jmacd commented 5 months ago

To help focus this conversation, the specification does have language to avoid API users entering raw byte -value attributes.

https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/common#attribute

I believe that (a) we should require strings to be valid utf8, (b) we should add "byte slice" as a primitive value, with no expectation for utf8 validity. We may use either this issue or #3950 to discuss.

pellared commented 5 months ago

requiring strings to be valid utf8

Just a notice, I think this is only essential for OTLP and not for the API. For instance, in Java and .NET string is UTF-16 encoded and I do not see any problem with it. Maybe we should say that it must be possible to convert the string to UTF-8 encoding?

https://github.com/open-telemetry/opentelemetry-specification/blob/7145a5d2ae302070033c2088a2416812edb7f878/specification/common/attribute-type-mapping.md?plain=1#L104-L113

I think this should be rephrased to something like

If possible, string values MUST be converted to AnyValue's string_value field.

Otherwise, string values MUST be converted to AnyValue's bytes_value with the bytes representing the string in the original order and format of the source string.

jmacd commented 5 months ago

@pellared The document you is restricted to attribute values, and my issue here is about all the other strings too. Sure, other language runtimes have an expectation of UTF-16 -- and what should a .NET or Java library do if by some unsafe mechanism they realize invalid UTF-16 string? OpenTelemetry's error-handling guidelines say they can't produce an error.

I don't know how easy it is to produce invalid UTF-16 at runtime in Java or .NET, but it's fairly easy in Go and Rust (using unsafe) to produce invalid UTF-8. I have put together a position statement in the draft OTEP here: https://github.com/open-telemetry/oteps/pull/257

open-telemetry / opentelemetry-specification

Clarify that API users SHOULD use valid UTF-8 for exportable data #3421