Open jleloup opened 4 months ago
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Pinging code owners for exporter/elasticsearch: @JaredTan95 @ycombinator. See Adding Labels via Comments if you do not have permissions to add labels yourself.
nice catch,it's indeed a bug
As of now I am quite a beginner in Go and OTEL code alike, though I'm trying to get to the bottom of it to fix it.
I assume those cases on elasticsearch_bulk.go should return the error to bubble it up. Then I assume it would be caught by a higher level object handling exporters & increase this metric, is that correct ?
I assume those cases on elasticsearch_bulk.go should return the error to bubble it up. Then I assume it would be caught by a higher level object handling exporters & increase this metric, is that correct ?
@jleloup You're on the right path here. The trick is going to be figuring out how to make pushDocuments
return the error from those cases, since the item.OnFailure
callback function is called asynchronously from the call to pushDocuments
.
Sadly I won't have much time to work on this issue.
So we tried an approach to fix this using a WaitGroup though we basically get context deadline exceeded at every occurrences.
The thing is it seem we have a conflicting situation here were on one hand OTEL expects exceptions to be returned synchronously for each calls of pushLogsData()
and on the other hand the ES library is doing everything it can to create its own batch of documents and bulk them all at once.
Trying to synchronously batch all those documents at once overall feels like a bad idea and would impact networking costs as well as performance.
Is there a way to get a handle on the core exporter metric itself so we could update it directly in BulkIndexer OnError()
closure ?
WDYT ?
Well actually I just saw PR #32359 and it seems it would change the game quite a lot, especially going for flushing data at each PushLogData() indeed.
First, does this issue still need work? It looks pretty old.
I've never contributed to an open source project before (besides fixing typos in documentation), but I've read this repository's contributing.md. Is there anything I still need to do before someone can assign the issue to me?
AFAK this issue still stands, we are recurringly having mapping issues with no impact on the metric.
Last time I checked, a refactoring was ongoing on the underlying lib used to bulk items in Elasticsearch. ,now that this has been merged, I assume it should allow for flushing item on-demand in the exporter, thus fetching a number of potential failed events during the bulk action and reporting it to OTEL.
Though I don't have time allocated to this right now, is there someone that could have a look at it ?
Hey, I can given an attempt to this issue. I am new to OTel collector but able to understand the issue. Will ask for help if unable to do this.
This PR #32632 address the actual problem.
@lalith47 thanks for the link. I agree that this may be the same issue as #32377 and after merging #32632 I'll come back and check if it fixes this #32302
Sure @carsonip. Let me know once the PR is merged. I can help testing this, as I am able to replicate the issue.
Component(s)
exporter/elasticsearch
What happened?
Description
I think the Elasticsearch exporter does not report failures properly in the
otelcol_exporter_send_failed_log_records
metrics.I am testing the behaviour of the elasticsearch exporter in the context of various failures. Eg.:
During these tests I can see error being reported in logs, eg for a mapping conflict: see log output for an example.
Though I can't see such failures on the
otelcol_exporter_send_failed_log_records
:I saw the very same behavior with another type of error: missing Elasticsearch permissions to create indices, which prevent the exporter from creating a new index when pushing if said index does not exists.
Steps to Reproduce
Trigger an Elasticsearch push failure:
Expected Result
otelcol_exporter_send_failed_log_records
counter is increased according to the failed log records.Actual Result
otelcol_exporter_send_failed_log_records
does not increase.Collector version
v0.97.0
Environment information
Environment
Kubernetes 1.28
OpenTelemetry Collector configuration
Log output
Additional context
No response