open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.84k stars 2.23k forks source link

[exporter/elasticsearchexporter] Push failures not reported in metrics #32302

Open jleloup opened 4 months ago

jleloup commented 4 months ago

Component(s)

exporter/elasticsearch

What happened?

Description

I think the Elasticsearch exporter does not report failures properly in the otelcol_exporter_send_failed_log_records metrics.

I am testing the behaviour of the elasticsearch exporter in the context of various failures. Eg.:

During these tests I can see error being reported in logs, eg for a mapping conflict: see log output for an example.

Though I can't see such failures on the otelcol_exporter_send_failed_log_records:

Screenshot 2024-04-10 at 20 24 55

I saw the very same behavior with another type of error: missing Elasticsearch permissions to create indices, which prevent the exporter from creating a new index when pushing if said index does not exists.

Steps to Reproduce

Trigger an Elasticsearch push failure:

Expected Result

otelcol_exporter_send_failed_log_records counter is increased according to the failed log records.

Actual Result

otelcol_exporter_send_failed_log_records does not increase.

Collector version

v0.97.0

Environment information

Environment

Kubernetes 1.28

OpenTelemetry Collector configuration

exporters:
  elasticsearch/logs:
    endpoints: [--redacted--]
    user: --redacted--
    password: --redacted--

    logs_index: --redacted--
    logs_dynamic_index:
      enabled: true

    mapping:
      mode: none
      dedup: true
      dedot: true

    sending_queue:
      enabled: true
      num_consumers: 20
      queue_size: 1000

pipelines:
  logs:
    receivers: [--redacted--]
    processors:
      - memory_limiter
      - resourcedetection
      - resource
      - k8sattributes
      - batch
      - groupbyattrs/compaction
    exporters:
      - elasticsearch/logs

Log output

{"level":"error","ts":1712773343.8512416,"caller":"elasticsearchexporter@v0.97.0/elasticsearch_bulk.go:219","msg":"Drop docs: failed to index: struct { Type string \"json:\\\"type\\\"\"; Reason string \"json:\\\"reason\\\"\"; Cause struct { Type string \"json:\\\"type\\\"\"; Reason string \"json:\\\"reason\\\"\" } \"json:\\\"caused_by\\\"\" }{Type:\"mapper_parsing_exception\", Reason:\"object mapping for [Body] tried to parse field [Body] as object, but found a concrete value\", Cause:struct { Type string \"json:\\\"type\\\"\"; Reason string \"json:\\\"reason\\\"\" }{Type:\"\", Reason:\"\"}}","kind":"exporter","data_type":"logs","name":"elasticsearch/logs","attempt":1,"status":400,"stacktrace":"github.com/open-telemetry/opentelemetry-collector-contrib/exporter/elasticsearchexporter.pushDocuments.func1\n\tgithub.com/open-telemetry/opentelemetry-collector-contrib/exporter/elasticsearchexporter@v0.97.0/elasticsearch_bulk.go:219\ngithub.com/elastic/go-elasticsearch/v7/esutil.(*worker).flush\n\tgithub.com/elastic/go-elasticsearch/v7@v7.17.10/esutil/bulk_indexer.go:599\ngithub.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).init.func1\n\tgithub.com/elastic/go-elasticsearch/v7@v7.17.10/esutil/bulk_indexer.go:321"}

Additional context

No response

github-actions[bot] commented 4 months ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 4 months ago

Pinging code owners for exporter/elasticsearch: @JaredTan95 @ycombinator. See Adding Labels via Comments if you do not have permissions to add labels yourself.

JaredTan95 commented 4 months ago

nice catch,it's indeed a bug

jleloup commented 4 months ago

As of now I am quite a beginner in Go and OTEL code alike, though I'm trying to get to the bottom of it to fix it.

I assume those cases on elasticsearch_bulk.go should return the error to bubble it up. Then I assume it would be caught by a higher level object handling exporters & increase this metric, is that correct ?

ycombinator commented 4 months ago

I assume those cases on elasticsearch_bulk.go should return the error to bubble it up. Then I assume it would be caught by a higher level object handling exporters & increase this metric, is that correct ?

@jleloup You're on the right path here. The trick is going to be figuring out how to make pushDocuments return the error from those cases, since the item.OnFailure callback function is called asynchronously from the call to pushDocuments.

jleloup commented 4 months ago

Sadly I won't have much time to work on this issue.

jleloup commented 3 months ago

So we tried an approach to fix this using a WaitGroup though we basically get context deadline exceeded at every occurrences.

The thing is it seem we have a conflicting situation here were on one hand OTEL expects exceptions to be returned synchronously for each calls of pushLogsData() and on the other hand the ES library is doing everything it can to create its own batch of documents and bulk them all at once. Trying to synchronously batch all those documents at once overall feels like a bad idea and would impact networking costs as well as performance.

Is there a way to get a handle on the core exporter metric itself so we could update it directly in BulkIndexer OnError() closure ? WDYT ?

jleloup commented 3 months ago

Well actually I just saw PR #32359 and it seems it would change the game quite a lot, especially going for flushing data at each PushLogData() indeed.

sumwun1 commented 2 months ago

First, does this issue still need work? It looks pretty old.

I've never contributed to an open source project before (besides fixing typos in documentation), but I've read this repository's contributing.md. Is there anything I still need to do before someone can assign the issue to me?

jleloup commented 2 months ago

AFAK this issue still stands, we are recurringly having mapping issues with no impact on the metric.

Last time I checked, a refactoring was ongoing on the underlying lib used to bulk items in Elasticsearch. ,now that this has been merged, I assume it should allow for flushing item on-demand in the exporter, thus fetching a number of potential failed events during the bulk action and reporting it to OTEL.

Though I don't have time allocated to this right now, is there someone that could have a look at it ?

lalith47 commented 2 months ago

Hey, I can given an attempt to this issue. I am new to OTel collector but able to understand the issue. Will ask for help if unable to do this.

lalith47 commented 1 month ago

This PR #32632 address the actual problem.

carsonip commented 1 month ago

@lalith47 thanks for the link. I agree that this may be the same issue as #32377 and after merging #32632 I'll come back and check if it fixes this #32302

lalith47 commented 1 month ago

Sure @carsonip. Let me know once the PR is merged. I can help testing this, as I am able to replicate the issue.