open-telemetry / opentelemetry-java

OpenTelemetry Java SDK
https://opentelemetry.io
Apache License 2.0
1.97k stars 812 forks source link

Returning the http response status code and error message from HttpExporter.export( ) #6306

Closed galahad098 closed 3 days ago

galahad098 commented 6 months ago

Is your feature request related to a problem? Please describe. No

Describe the solution you'd like In my implementation of the OtlpHttpSpanExporter, I want to access the http response status code and error message in case the export fails. Internally, the HttpExporter.export( ) method returns a CompletableResultCode response indicating successful or failed export and logs the status code and error message. I would like to know if it's possible to return this information from the exporter.

Describe alternatives you've considered An alternative would be to write a custom exporter which returns all the required info.

jack-berg commented 6 months ago

The main problem is that the export API is stable, so we would need to evolve the generic CompletableResultCode class to be able to hold supplementary contextual information about the result (success or failure). Its possible, and is something I've thought about before, but it doesn't seem worthwhile when you consider that exporters are almost always paired with batch processor or periodic metric reader which wouldn't do anything with this additional info.

Consuming this info would require a custom processor or metric reader. So I'm curious to learn more about your use case. How are you using the OtlpHttpSpanExporter? What kind of info would you expect if there was no http status code (i.e. it was a connect or timeout error), or if it failed after retrying several times? What would you do with this info if it were available?

galahad098 commented 6 months ago

So we basically have an otel extension wherein custom span processors write to a queue which is periodically flushed via the exporter. The spans are exported to a collector service which routes the data to another consuming service and returns status codes for 3 scenarios : complete success (all spans routed successfully), complete failure and partial success. In case of partial success, we only want to retry for the failed messages in the exported batch. Thus, it would be helpful to consume this additional status code info on top of the success/failure of the CompleteableResultCode.

galahad098 commented 6 months ago

Hi @jack-berg , Do you think it would be possible to undertake the enhancement of CompletableResultCode class? Being able to read the response would be really useful for us.

jack-berg commented 5 months ago

Check out #6348, which lays the groundwork for this by evolving the CompletableResultCode API so that exception details can be included.

galahad098 commented 5 months ago

Thanks for taking this up.. So the exception thrown in case of failure will now contain the error details and the HTTP response status code? Is that correct? is it possible to return the response body as well?

jack-berg commented 5 months ago

Thanks for taking this up.. So the exception thrown in case of failure will now contain the error details and the HTTP response status code? Is that correct? is it possible to return the response body as well?

Additional PR(s) are needed for exporters to include exception in CompletableResultCode, and to decide how to represent what types of information is included in the exception. We don't do anything at all with the response body today, so including it will present an additional challenge.

galahad098 commented 5 months ago

Okay. I'll go through the guidelines for contribution and please let me know if I could pitch in anywhere. Thanks

trask commented 5 months ago

In case of partial success, we only want to retry for the failed messages in the exported batch. Thus, it would be helpful to consume this additional status code info on top of the success/failure of the CompleteableResultCode.

@jack-berg is this something that could make sense to implement directly in the OTLP exporter?

Or if we can't change the default retry behavior, maybe a hook that allows people to override the default retry behavior?

jack-berg commented 5 months ago

This is quite complicated to implement at the moment, regardless of where:

I think the reality is that as of today, partial success responses are not designed for this type of partial retry scenario. You can see in the protobuf comments that its intended to help a developer make config changes to address the problem, not for an exporter to interpret and automatically perform some a retry routine:

A developer-facing human-readable message in English. It should be used // either to explain why the server rejected parts of the data during a partial // success or to convey warnings/suggestions during a full success. The message // should offer guidance on how users can address such issues.

maybe a hook that allows people to override the default retry behavior?

I could be open to this, but doing so is complicated so would want to see evidence that several people are running into issues. And still, it might be more appropriate to solve this by: 1. Passing failure details back via CompletableResultCode. 2. Disabling retry. 3. Pairing the exporter with a custom processor which interprets the error from CompletableResultCode and implements retry at the processor level.

message ExportTraceServiceResponse {
  // The details of a partially successful export request.
  //
  // If the request is only partially accepted
  // (i.e. when the server accepts only parts of the data and rejects the rest)
  // the server MUST initialize the `partial_success` field and MUST
  // set the `rejected_<signal>` with the number of items it rejected.
  //
  // Servers MAY also make use of the `partial_success` field to convey
  // warnings/suggestions to senders even when the request was fully accepted.
  // In such cases, the `rejected_<signal>` MUST have a value of `0` and
  // the `error_message` MUST be non-empty.
  //
  // A `partial_success` message with an empty value (rejected_<signal> = 0 and
  // `error_message` = "") is equivalent to it not being set/present. Senders
  // SHOULD interpret it the same way as in the full success case.
  ExportTracePartialSuccess partial_success = 1;
}

message ExportTracePartialSuccess {
  // The number of rejected spans.
  //
  // A `rejected_<signal>` field holding a `0` value indicates that the
  // request was fully accepted.
  int64 rejected_spans = 1;

  // A developer-facing human-readable message in English. It should be used
  // either to explain why the server rejected parts of the data during a partial
  // success or to convey warnings/suggestions during a full success. The message
  // should offer guidance on how users can address such issues.
  //
  // error_message is an optional field. An error_message with an empty value
  // is equivalent to it not being set.
  string error_message = 2;
}
jarrodrobins commented 4 months ago

Thanks for taking a look into this @jack-berg - we also would like to be able to get the failure details out of CompletableResultCode. My use case sounds a little less complex in that we're not looking into partial failures but do want to be able to respond to GRPC status codes, which appear to already be available looking at GrpcExporter.

For us, we want to handle some of the codes listed here (some of which are referenced in GrpcExporterUtil (though not INVALID_ARGUMENT and ALREADY_EXISTS, which we want). If we could somehow bubble up the GRPC status code out of GrpcExporter in CompletableResultCode that would solve at least my use case.

Right now we can only determine if an export failed, but not why. We want to be able to read this in our custom exporter and discard spans (and send a custom metric) if they've already been uploaded in the event that ALREADY_EXISTS is returned so we don't keep retrying to upload them.

Your PR gets us a decent way there - I'd be happy to assist in updating GrpcExporter and HttpExporter if needed.

SmithaMJ commented 3 weeks ago

Thank you @jack-berg and @jarrodrobins for the status code changes. You have saved a lot of our effort :)

I would like to know approximately by when can we expect a release to test?

Regards Smitha

jack-berg commented 3 weeks ago

You can test against snapshots of the main branch now. We publish releases the friday after the first monday of the month (i.e. 9/6/24).