Execute item failure callback execution on bulk request error

kellen-miller commented 17 hours ago

Description

Execute all bulk item failure callbacks (if they exist) in worker's buffer when a bulk request fails.

Issues Resolved

Fixes https://github.com/opensearch-project/opensearch-go/issues/626

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov[bot] commented 17 hours ago

Codecov Report

Attention: Patch coverage is 0% with 10 lines in your changes missing coverage. Please review.

Project coverage is 60.31%. Comparing base (06a6dc8) to head (f416719). Report is 70 commits behind head on main.

Files with missing lines	Patch %	Lines
opensearchutil/bulk_indexer.go	0.00%	10 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #627 +/- ## ========================================== + Coverage 57.29% 60.31% +3.01% ========================================== Files 315 374 +59 Lines 9823 8794 -1029 ========================================== - Hits 5628 5304 -324 + Misses 2902 2182 -720 - Partials 1293 1308 +15 ``` | [Flag](https://app.codecov.io/gh/opensearch-project/opensearch-go/pull/627/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project) | Coverage Δ | | |---|---|---| | [integration](https://app.codecov.io/gh/opensearch-project/opensearch-go/pull/627/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project) | `60.31% <0.00%> (+9.47%)` | :arrow_up: | | [unit](https://app.codecov.io/gh/opensearch-project/opensearch-go/pull/627/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project) | `?` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files with missing lines](https://app.codecov.io/gh/opensearch-project/opensearch-go/pull/627?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project) | Coverage Δ | | |---|---|---| | [opensearchutil/bulk\_indexer.go](https://app.codecov.io/gh/opensearch-project/opensearch-go/pull/627?src=pr&el=tree&filepath=opensearchutil%2Fbulk_indexer.go&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project#diff-b3BlbnNlYXJjaHV0aWwvYnVsa19pbmRleGVyLmdv) | `59.41% <0.00%> (-14.40%)` | :arrow_down: | ... and [273 files with indirect coverage changes](https://app.codecov.io/gh/opensearch-project/opensearch-go/pull/627/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=opensearch-project)

kellen-miller commented 16 hours ago

I ordered the execution of the item OnFailure callbacks before the execution of the bulk indexer's OnError callback, but am open to feedback on whether that order should be reversed.

My thought process was if the OnError callback is executed first, a user may have some logic to shutdown/panic their application, which would result in the item OnFailure callbacks to not be executed inadvertently.

kellen-miller commented 15 hours ago

Correct that OnFailure below if the response is > 201 is called when the bulk request is successful and an individual item in the request fails, say OpenSearch had an indexing failure with the item.

But if the whole bulk request errors, for example if OpenSearch goes down while the request is in flight, or the context deadlines, the flush function returns without executing the item failure callbacks and the workers buffer is cleared. So those items are lost and their callbacks are never executed.

dblock commented 15 hours ago

Correct that OnFailure below if the response is > 201 is called when the bulk request is successful and an individual item in the request fails, say OpenSearch had an indexing failure with the item.

But if the whole bulk request errors, for example if OpenSearch goes down while the request is in flight, or the context deadlines, the flush function returns without executing the item failure callbacks and the workers buffer is cleared. So those items are lost and their callbacks are never executed.

Understood. And just to double-check, we don't call those callbacks twice in the successful response but some failed items case? Add a test for this?

kellen-miller commented 15 hours ago

Understood. And just to double-check, we don't call those callbacks twice in the successful response but some failed items case? Add a test for this?

Yep the failure callback will only be executed once. If the response is successful (the error here is nil), the code in this PR is never hit.

My unit test I added has a check to make sure that the number of times the failure callback is executed is what is expected i.e. nothing is called twice. https://github.com/kellen-miller/opensearch-go/blob/bugfix/execute-item-fail-callbacks-bulk-error/opensearchutil/bulk_indexer_internal_test.go#L833

dblock commented 14 hours ago

Understood. And just to double-check, we don't call those callbacks twice in the successful response but some failed items case? Add a test for this?

Yep the failure callback will only be executed once. If the response is successful (the error here is nil), the code in this PR is never hit.

My unit test I added has a check to make sure that the number of times the failure callback is executed is what is expected i.e. nothing is called twice. https://github.com/kellen-miller/opensearch-go/blob/bugfix/execute-item-fail-callbacks-bulk-error/opensearchutil/bulk_indexer_internal_test.go#L833

I think there's still a test missing for partial failure on some of the items, you want to check that if you have 5 items and 2 failures the callback is only called twice, for example?

kellen-miller commented 13 hours ago

I think there's still a test missing for partial failure on some of the items, you want to check that if you have 5 items and 2 failures the callback is only called twice, for example?

I think this unit test earlier in the file already covers that? https://github.com/kellen-miller/opensearch-go/blob/bugfix/execute-item-fail-callbacks-bulk-error/opensearchutil/bulk_indexer_internal_test.go#L277

Specifically this check here in the unit test: https://github.com/kellen-miller/opensearch-go/blob/bugfix/execute-item-fail-callbacks-bulk-error/opensearchutil/bulk_indexer_internal_test.go#L391-L398

I might be misunderstanding though, I'm happy to add more if that's not the use case you're referring too.

dblock commented 13 hours ago

I might be misunderstanding though, I'm happy to add more if that's not the use case you're referring too.

First, thank you for hanging in here with me ;)

The old code did not call item.OnFailure(ctx, item, info, err) in this scenario at all, but did increment the number of fails (and then that was tested). So those tests with only some item callbacks also should check how many times item.OnFailure(ctx, item, info, err) has been invoked and that it was invoked with all the right items.

Without trying, some examples where I think all tests pass but the code is broken: call item.OnFailure(ctx, **the first item every time**, info, err). Or, with first 2/5 items failing call onFailure 5 times.

kellen-miller commented 12 hours ago

The old code did not call item.OnFailure(ctx, item, info, err) in this scenario at all, but did increment the number of fails (and then that was tested). So those tests with only some item callbacks also should check how many times item.OnFailure(ctx, item, info, err) has been invoked and that it was invoked with all the right items.

Without trying, some examples where I think all tests pass but the code is broken: call item.OnFailure(ctx, the first item every time, info, err). Or, with first 2/5 items failing call onFailure 5 times.

Ah gotcha gotcha, sorry for some much back and forth 😄. Let me update the other tests to check for this.

dblock commented 12 hours ago

Ah gotcha gotcha, sorry for some much back and forth 😄. Let me update the other tests to check for this.

No stress! I enjoy this type of conversations because I am super OCD about tests :)

opensearch-project / opensearch-go