open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.69k stars 884 forks source link

OTLP HTTP Server Response Code #3203

Open dineshg13 opened 1 year ago

dineshg13 commented 1 year ago

What are you trying to achieve? We are trying to setup a OTLP HTTP Server, it is not clear if responding with any code other than 200 is considered error. Every SDK is handling this differently.

  1. OTLP exporter n collector treats all codes in [200, 300) as success:

  2. Java otlp http exporter treats all codes in [200, 300) as success: https://github.com/open-telemetry/opentelemetry-java/blob/3d73283a71a2a6e1d6dd5e725098f697b1edae03/exporters/common/src/main/java/io/opentelemetry/exporter/internal/okhttp/OkHttpExporter.java#L120 .

    public boolean isSuccessful()
    Returns true if the code is in [200..300), which means the request was successfully received, understood, and accepted.
  3. Python http exporter treats 200 and 202 as success:

  4. Go HTTP Exporter treats only 200 status code as success.

What did you expect to see? I would expect the specification to say explicitly that any 2xx shouldn't be considered an error.

Additional context. See discussion on PR https://github.com/open-telemetry/opentelemetry-go/pull/3707

arminru commented 1 year ago

Hey @dineshg13!

The specification itself is quite clear on this:

https://github.com/open-telemetry/opentelemetry-specification/blob/84a5fa3d1ef42caae8f99f1b431504d7cdb9fffe/specification/protocol/otlp.md#L490 https://github.com/open-telemetry/opentelemetry-specification/blob/84a5fa3d1ef42caae8f99f1b431504d7cdb9fffe/specification/protocol/otlp.md#L502-L504

If your backend follows this spec and only returns 200 OK, it should be accepted by all compliant exporters.

jack-berg commented 1 year ago

The requirement for only accepting 200 as success was part of the original PR defining OTLP. I didn't see any comments that suggest why this is the case.

MSNev commented 1 year ago

All 2xx codes should be treated as successful, there is a scenario with browsers where the server response MUST return a 204 to avoid browsers from creating excessive connections (which can cause requests to block navigation and never get sent -- resulting in lost telemetry)

For clarification, during page unload when the browser is sending requests via the API's used for sending requests (sendBeacon() and fetch() with keep-alive (which are currently the only reliable API's to send requests during page unload)) IF the response returns a 200 this causes browser (chromium specifically) to have to establish a new connection as it ignores the body (because it is not expecting one) and it can't expect that the connection stream is valid, while when a 204 response is returned it can and does reuse the existing connection.

To fully "support" this scenario the sender (exporter) should pass a flag to inform the server that if everything is successful and it would have normally returned a 200 (with body) that it should return a 204 and no body.

tigrannajaryan commented 1 year ago

We have this in the spec:

All other HTTP responses that are not explicitly listed in this document should be treated according to HTTP specification.

So the spec says that the Client must be ready to receive other status codes. The spec does not say when the Server should use these codes.

We can try to refine the spec in a way that clarifies what happens when 2xx or 3xx is used. We must make this change in a way that is backward compatible (I am not sure how exactly it can be done).

dineshg13 commented 1 year ago

I agree with @tigrannajaryan . The spec needs to clarify how the client handles 2xx or 3xx. Considering only 200 as the non-error response seems overly restrictive.

Aneurysm9 commented 1 year ago

All other HTTP responses that are not explicitly listed in this document should be treated according to HTTP specification.

So the spec says that the Client must be ready to receive other status codes. The spec does not say when the Server should use these codes.

We can try to refine the spec in a way that clarifies what happens when 2xx or 3xx is used. We must make this change in a way that is backward compatible (I am not sure how exactly it can be done).

I think for the 2xx series the spec is fairly conclusive that 200 OK is the only status code a server should use. It is the only valid response for full success and for partial success. There are no other success scenarios that are not either full or partial success; everything else is some form of error or a redirect.

I think there's a case to be made for allowing 204 No Content in the situations @MSNev has described and where full success is to be reported. Not having a response body makes it impossible to use for partial success. That said, I'd offer my usual concern about over-indexing on the known-bad behavior of older browser implementations that may be a non-issue for much of the useful life of this specification.

For 202 Accepted I think we need to be very careful about introducing asynchronous behavior at a point at which synchronous behavior is expected. With a request that is accepted but not processed there is then no means for reporting partial success or other error states without also specifying callback mechanisms and increasing the surface area and complexity of the protocol and its implementations. Every client implementation will need to deal with a new set of scenarios and flows. Every server implementation will now have multiple different sets of expectations from clients with no clear way of signaling which is in use.

MSNev commented 1 year ago

@Aneurysm9 Just to clarify

I'd offer my usual concern about over-indexing on the known-bad behavior of older browser implementations that may be a non-issue for much of the useful life of this specification.

The requirement to return 204 is actually a newer browser implementation, not older known-bad behavior of older browser implementations, older browser do not have this issue because they don't enforce CORB requirements. And this issue occurs because client (browser) kills the connection when the CORB violation is caused by the server sending back an unwanted body.

And with (really) old browsers there was no sendBeacon() and until recently you could still use synchronous (blocking) XMLHttpRequest during page unload to ensure that the request was sent (it's been marked as deprecated for years), and now with newer browsers synchronous (blocking) XMLHttpRequest during page unload is blocked and you MUST use sendBeacon() to send the request.

Aneurysm9 commented 1 year ago

Thanks for the clarification @MSNev. I indeed did have my understanding of the landscape backwards. I think the quoted part of my comment can be removed and the rest stands.