loadbalancing exporter blocking the process of consume traces when loadbalancer backends change

ralphgj commented 2 years ago

Describe the bug Hi, we found the loadbalancing exporter sometimes would block all the requests of upload trace data. So we added some logs and found the endpoint exporters that should be removed shutdown process spent a long time.

In https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/loadbalancer.go#L121,

    if !newRing.equal(lb.ring) {
        lb.updateLock.Lock()
        defer lb.updateLock.Unlock()

        lb.ring = newRing

        // TODO: set a timeout?
        ctx := context.Background()

        // add the missing exporters first
        lb.addMissingExporters(ctx, resolved)
        lb.removeExtraExporters(ctx, resolved)
    }

when loadbalancer backends changed, lb.removeExtraExporters sometimes spent more than 10s, so it blocked the consume processing in https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/trace_exporter.go#L97 and the cause seems onBackendChanges hold the updateLock

Can we make the shutdown process async like this? In https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/loadbalancingexporter/loadbalancer.go#L152,

func (lb *loadBalancerImp) removeExtraExporters(ctx context.Context, endpoints []string) {
    for existing := range lb.exporters {
        if !endpointFound(existing, endpoints) {
            go lb.exporters[existing].Shutdown(ctx)
            delete(lb.exporters, existing)
        }
    }
}

What version did you use? Version: 0.42.0

What config did you use? Config:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "localhost:4318"

exporters:
  logging:
    logLevel: debug
  loadbalancing:
    protocol:
      otlp:
        tls:
          insecure: true
        retry_on_failure:
          enabled: true
          initial_interval: 5s
          max_interval: 20s
          max_elapsed_time: 60s
        sending_queue:
          enabled: true
          num_consumers: 20
          queue_size: 1000000
    resolver:
      dns:
        hostname: opentelemetry-collector-sampling.otel.svc.cluster.local
        port: 4317

extensions:
  health_check:
    endpoint: "localhost:13134"

service:
  extensions: [ health_check ]
  pipelines:
    traces:
      receivers: [ otlp ]
      exporters: [ loadbalancing ]

Environment docker image: otel/opentelemetry-collector-contrib:0.42.0

Additional context Add any other context about the problem here.

jpkrohling commented 2 years ago

Good catch!

@bogdandrutu, @codeboten, @tigrannajaryan, do you think calling an exporter's shutdown on a separate Go routine would be in line with the function semantics?

tigrannajaryan commented 2 years ago

Side comment: it is quite confusing that an exporter manages other exporters. We need to improve our pipelines so that this can be a processor.

As for the specifics the shutdown calls/implementation. I believe Shutdown() itself is implemented incorrectly. It sets a flag and exits where the Shutdown contract says:

If there are any background operations running by the component they must be aborted before
this function returns.
...
If there are any buffers in the
component, they should be cleared and the data sent immediately to the next component.

So, I think the Shutdown itself needs to be fixed first of all.

do you think calling an exporter's shutdown on a separate Go routine would be in line with the function semantics?

If the above contract about Shutdown's behavior is fulfilled I think it does not matter whether Shutdown is called in a different go routine. Shutdown is always called from a different go routine by Service, so it is not a unique situation. That the caller of Shutdown does not wait until shutdown is complete is a problem though. It breaks the pipeline shutdown conceptually.

So in a nutshel:

Shutdown must block until it is really done and has flushed all buffers.
Callers of Shutdown must wait until it completes before returning (or must block their own Shutdown call by whatever means available).

jpkrohling commented 2 years ago

Agree with all your points, especially about the load balancing being an exporter that has direct access to other components. Until the collector has the pieces in place that allows for a migration, I'll migrate this component.

Callers of Shutdown must wait until it completes before returning

Thanks, I think this clarifies it.

@ralphgj, you can probably tweak the retry logic for the inner OTLP exporter, so that it fails faster. I suspect the backend is offline, which is causing the retry to kick in, eventually timing out.

ralphgj commented 2 years ago

Thanks all! @jpkrohling Your suspect is correct. The problem described above would happened when we redeployed the sampling collector (trace data -> load balancing collector -> sampling collector).

you can probably tweak the retry logic for the inner OTLP exporter, so that it fails faster.

The retry logic seems for the situation on sampling collector can't be reached. So tweak the retry logic maybe not suit this problem.

In OTLP exporter comments, the shutdown process will drain the queue and call the retry. When the queue is too long, the shutdown process will spend more time.

func (qrs *queuedRetrySender) shutdown() {
    // Cleanup queue metrics reporting
    if qrs.cfg.Enabled {
        _ = globalInstruments.queueSize.UpsertEntry(func() int64 {
            return int64(0)
        }, metricdata.NewLabelValue(qrs.fullName))
    }

    // First Stop the retry goroutines, so that unblocks the queue numWorkers.
    close(qrs.retryStopCh)

    // Stop the queued sender, this will drain the queue and will call the retry (which is stopped) that will only
    // try once every request.
    if qrs.queue != nil {
        qrs.queue.Stop()
    }
}

Could you give us some other suggestions?

ralphgj commented 2 years ago

Another question, now the queue seems included in endpoint exporters, and consume trace data need to select an endpoint. So the consume process will be blocked by the select endpoint process. Why not put trace data in queue when consume process and then drain the trace data from the queue for sending to remote by load balancer?

jpkrohling commented 2 years ago

The queued retry component does not know about the load balancer, nor the load balancer about the queued retry. Once the load balancer sent the data to the exporter, it will only get back upon a failure or success from the underlying exporter, which will only return once the retries have been exhausted.

One option would be to have the queued retry to be used by the load balancer in some way or format, so that people would not use it in the final exporters, but I need to think about the implications of it.

jpkrohling commented 2 years ago

I won't be able to look at this one in detail for the next two months. I'm therefore unassigning this. This is now up for grabs. If nothing happens in two months, I might be able to take a look again.

dgoscn commented 2 years ago

@jpkrohling can I take this one?

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling commented 1 year ago

@dgoscn, are you still available to work on this one?

dgoscn commented 1 year ago

@jpkrohling I will have to let for someone else. If no one else assign, I want to go back to work on this as soon as possible.

open-telemetry / opentelemetry-collector-contrib

loadbalancing exporter blocking the process of consume traces when loadbalancer backends change #8843