open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.3k stars 1.42k forks source link

Ensure reliable data delivery in erroneous situations #7460

Open tigrannajaryan opened 1 year ago

tigrannajaryan commented 1 year ago

I have been looking at what the Collector does in erroneous situations, such as when the exporter's destination is temporarily unavailable and found a number of places in our codebase where we are probably not handling the error cases correctly, which may result in data losses when such data losses are preventable.

This is a summary task where I want to capture a list of things I think can be improved.

Receivers

Some (most?) receivers currently do not follow the Error Handling contract which says:

// If the error is non-Permanent then the nextConsumer.Consume*() call should be retried
// with the same data.

Here are open issues for some receivers:

If you discover more such receivers please add to the list above.

I would also like to add test helpers that allows us to easily verify the compliance of receivers to the contract. Here is a task to add such helper:

Another part of receiver contracts acknowledgement and checkpointing handling. This issue needs to be resolved to ensure that part of the contract is fulfilled:

Exporters

QueuedRetry exporter helper does not have requeuingEnabled=true when in-memory queue is used. This results in data being dropped after a few retries. We don't want this. We want to keep retrying, just like we do when persistent queue is enabled. Perhaps we make requeuingEnabled as user-visible option if we don't want to hard-code the requeuing behavior.

Here is the list of tasks to resolve:

Batching

batchprocessor currently drops the accumulated batch if the next consumer returns non-permanent error. This is not a correct behavior.

We have 2 possible solutions:

End-to-end Testing

There are many places in our codebase where incorrect implementations and bugs may cause loss of data, especially in stressed and erroneous situations. To ensure the Collector works correctly we want to add end-to-end integration tests that verify the operation of the Collector as a whole:

Without having such test it is hard to be confident that we do not drop the data somewhere due to a bug or due to a component being incorrectly implemented. This should include testing when memory limit is hit.

Here is a task to implement this:

Clarify the Design

Here is the list of issues to resolve:

tigrannajaryan commented 1 year ago

FYI @open-telemetry/collector-approvers

bgrouxupgrade commented 6 months ago

Re: Batch Exporter. Which path was decided on? If it's the quick fix is there a ticket I can follow?

mx-psi commented 2 weeks ago

I believe the only issue remaining that would be required for 1.0 is #4335. I added that one to the Collector v1 project and I am going to remove this meta-issue from the Collector v1 project.