opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
8.89k stars 1.63k forks source link

[Backport] Support for translog pruning based on retention leases #1226

Closed saikaranam-amazon closed 2 years ago

saikaranam-amazon commented 2 years ago

Description

Backport of https://github.com/opensearch-project/OpenSearch/pull/1038

Issues Resolved

N/A

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

opensearch-ci-bot commented 2 years ago

:white_check_mark:   DCO Check Passed c5ff5261b102d8cf6fe83568daf6ab86711c9cde

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Wrapper Validation success c5ff5261b102d8cf6fe83568daf6ab86711c9cde

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Precommit success c5ff5261b102d8cf6fe83568daf6ab86711c9cde

itiyamas commented 2 years ago

start gradle check

opensearch-ci-bot commented 2 years ago

:x:   Gradle Check failure c5ff5261b102d8cf6fe83568daf6ab86711c9cde Log 504

Reports 504

nknize commented 2 years ago

I've seen these issues before:

* What went wrong:
Execution failed for task ':distribution:bwc:minor:buildBwcLinuxTar'.
> Building 1.0.0 didn't generate expected file /var/CITOOL/workflow/OpenSearch_CI/PR_Checks/Gradle_Check/search/distribution/bwc/minor/build/bwc/checkout-1.0/distribution/archives/linux-tar/build/distributions/opensearch-1.0.0-SNAPSHOT-linux-x64.tar.gz

It usually means the checked out codebase that's being built for bwc testing generated a version artifact that did not match the expected version. In this case, the build expected a 1.0.0 build but got something else (my money is on 1.0.1 since 1.0 branch is now at a staged bugfix release).

I haven't had a chance to circle back on this so I'm speculating that the bwc builder is either looking for (or needs to look for) the release tag to check out and build the appropriate bwc version. We inherited that code and release tags upstream always looked like vM.m.p until OpenSearch decided to drop the github recommended v in favor of just the release number M.m.p. Perhaps the Bwc distribution builder is looking for v1.0.0 tag and not finding it so falling back to the 1.0 branch (which is now a staged 1.0.1 and :boom: ).

@adnapibar is familiar with the BWC builder so perhaps he can take a look?

dblock commented 2 years ago

We should re-add the v in the tag if this is the case, see https://github.com/opensearch-project/.github/issues/35.

nknize commented 2 years ago

Is this feature a blocker for the 1.1 release or can it hold and bake a while for 1.2?

saikaranam-amazon commented 2 years ago

Is this feature a blocker for the 1.1 release or can it hold and bake a while for 1.2?

This is blocker for 1.1 release

adnapibar commented 2 years ago

Can you rebase your branch onto https://github.com/opensearch-project/OpenSearch/tree/1.x

nknize commented 2 years ago

This is blocker for 1.1 release

What is it blocking?

saikaranam-amazon commented 2 years ago

Can you rebase your branch onto https://github.com/opensearch-project/OpenSearch/tree/1.x

Done. Pushed the change

What is it blocking?

The changes are needed for the replication plugin releasing with 1.1

nknize commented 2 years ago

The changes are needed for the replication plugin releasing with 1.1

Do you have a merged PR on the replication plugin repo that depends on this logic?

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Wrapper Validation success 81f64c4d5271944b354d5698becc87579fdfbc5e

opensearch-ci-bot commented 2 years ago

:white_check_mark:   DCO Check Passed 81f64c4d5271944b354d5698becc87579fdfbc5e

saikaranam-amazon commented 2 years ago

Do you have a merged PR on the replication plugin repo that depends on this logic?

Yes, changes are merged here: link

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Precommit success 81f64c4d5271944b354d5698becc87579fdfbc5e

nknize commented 2 years ago

Yes, changes are merged here:

That change was merged 2 months ago; and it depends on a setting that has not been in OpenSearch core until today? How is this possible? Release is scheduled for less than a week and we're trying to backport a new setting that has no time to bake.

Huge -1 for this.

saikaranam-amazon commented 2 years ago

That change was merged 2 months ago; and it depends on a setting that has not been in OpenSearch core until today? How is this possible? Release is scheduled for less than a week and we're trying to backport a new setting that has no time to bake.

The changes related to this PR on OpenSearch was raised more than 1 month back after thorough testing from our side. As Opensearch team was tracking multiple releases in between 1.1 and 1.0, changes to previous version were prioritized and reviewed before this.

nknize commented 2 years ago

The changes related to this PR on OpenSearch was raised more than 1 month back...

The issue was opened 22 days ago. The PR was opened before the issue. The changes were only recently merged to main, and a backport just now opened (2 hrs ago) against 1.x less than a week before the planned release. This is too late. I'm a fan of the proposed change but this didn't have enough time to bake before the 1.1 release. Huge recommendation to bump to 1.2.

after thorough testing from our side.

Unfortunately that's not the primary core CI; and we're now playing gradle check roulette in hopes that bwc picks a seed that passes all gates in order to wedge it in time for a release. This isn't a good approach as it only leads to knee jerk bugfix releases (which are not the easiest to do atm).

nknize commented 2 years ago

If we don't want to hold the feature for a 1.2 release then I recommend we hold the 1.1 release for at least a week to let this settle. I like the change, it's just too late to be wedging these in for a 1.1 release in less than a week. I'm going to mark this as a blocker for 1.1 so we have traceability. Let's hold on removing that blocker label until we're comfortable the changes have had time to bake.

adnapibar commented 2 years ago

start gradle check

CEHENKLE commented 2 years ago

@saikaranam-amazon Hi Sai! Can you tell us more about how your testing was done? If we're still seeing changes coming in against 1.1, how have you been able to validate the change against the full set?

Thanks! /C

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Check success 81f64c4d5271944b354d5698becc87579fdfbc5e Log 510

Reports 510

saikaranam-amazon commented 2 years ago

Can you tell us more about how your testing was done? If we're still seeing changes coming in against 1.1, how have you been able to validate the change against the full set?

These changes are tested on test clusters from our side. I understand that the 1.x branch is active and seeing changes, but the current changes are contained addressing some of the issues identified during replication plugin testing. These shouldn't affect existing OS 1.x code base.

nknize commented 2 years ago

I think we can continue to proactively work and merge these incremental improvements in main and let them bake while figuring out a best path to release. That's the nice thing about our branch strategy, no reason to perfect everything before merging, and no reason to jump the gun releasing a feature that needs a little more time and due diligence.

I'm wondering if we can leverage some of the OverrideablePlugin mechanisms that @setiah has been working on and implement these tlog settings through an extension in the CCR plugin? Alternatively we could look at adding a feature flag that only enables this when the flag is set? This would give more guardrails from users unknowingly setting something without fully understanding consequences.

nknize commented 2 years ago

Instead of implementing this IndexSetting in the core and baking it into the default TranslogDeletionPolicy can we refactor this PR to make the TranslogDeletionPolicy extendable (protected methods and variables) and implement the retention lease behavior in something like a CCRTranslogDeletionPolicy extends TranslogDeletionPolicy within the CCR plugin? I think this would help core not introduce a new IndexSetting that has to be supported long term (along w/ BWC) when we're trying to make translog logic become an extendable feature anyway to support segment level replication.

This extended DeletionPolicy can be produced by the TranslogConfig which has the IndexSettings available to it. The setting can then be implemented by the CCR plugin which implements a ReplicationEngine anyway (which can provide the custom DeletionPolicy). This totally decouples this logic into the one and only plugin that requires this setting allowing the plugin to implement its own data integrity tests (can you link that test plan here? maybe open a test plan issue in the cross-cluster-replication repo so it lives w/ the plugin code).

mch2 commented 2 years ago

Would we ever wire up CCR without this enabled? I'm wondering if we need to even bother with making this a setting. We can wire up a subclass of TranslogDeletionPolicy inside of ReplicationEngine, meaning it would be default behavior if CCR is installed.

nknize commented 2 years ago

Would we ever wire up CCR without this enabled?

That's a great question. I'm not all that familiar with the cross-cluster-replication plugin mechanics. I think you spin up a follower cluster as opposed to a follower index? If that's the case then logically it seems all indexes in the leader cluster are replicated to the follower cluster? And if so what's the logic in adding index level fidelity to change the deletion policy? It seems a reasonable assertion that retaining uncompressed tlog ops based on retention lease to avoid decompressing lucene stored fields significantly increases translog overhead? So maybe it makes sense to retain this index setting on high churn indexes?

mch2 commented 2 years ago

@nknize Are we looking at extracting changes from https://github.com/opensearch-project/OpenSearch/pull/1227 as well?

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Wrapper Validation success 22ad35f997f955c09b5fabe9bb0ac776b7113ac7

opensearch-ci-bot commented 2 years ago

:x:   DCO Check Failed 22ad35f997f955c09b5fabe9bb0ac776b7113ac7 Run ./dev-tools/signoff-check.sh remotes/origin/1.x 22ad35f997f955c09b5fabe9bb0ac776b7113ac7 to check locally Use git commit with -s to add 'Signed-of-by: {EMAIL}' on impacted commits

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Wrapper Validation success 3cb3577b2e552e812333c79ec5f66d2f0f0c407b

opensearch-ci-bot commented 2 years ago

:white_check_mark:   DCO Check Passed 3cb3577b2e552e812333c79ec5f66d2f0f0c407b

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Precommit success 22ad35f997f955c09b5fabe9bb0ac776b7113ac7

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Precommit success 3cb3577b2e552e812333c79ec5f66d2f0f0c407b

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Wrapper Validation success 73235f85bca1ac9664eea5aeed142a1437a0bdd1

opensearch-ci-bot commented 2 years ago

:white_check_mark:   DCO Check Passed 73235f85bca1ac9664eea5aeed142a1437a0bdd1

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Precommit success 73235f85bca1ac9664eea5aeed142a1437a0bdd1

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Wrapper Validation success 18138a88e0eb26d4beaea38c32c5244ef7051916

opensearch-ci-bot commented 2 years ago

:white_check_mark:   DCO Check Passed 18138a88e0eb26d4beaea38c32c5244ef7051916

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Precommit success 18138a88e0eb26d4beaea38c32c5244ef7051916

opensearch-ci-bot commented 2 years ago

:white_check_mark:   DCO Check Passed ed48381bfae62e0104617fcf0e0f79f1d5ef1c73

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Wrapper Validation success ed48381bfae62e0104617fcf0e0f79f1d5ef1c73

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Precommit success ed48381bfae62e0104617fcf0e0f79f1d5ef1c73

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Wrapper Validation success 91be63ce4b67a1a3a2de9183d46f93273aea4803

opensearch-ci-bot commented 2 years ago

:x:   DCO Check Failed 91be63ce4b67a1a3a2de9183d46f93273aea4803 Run ./dev-tools/signoff-check.sh remotes/origin/1.x 91be63ce4b67a1a3a2de9183d46f93273aea4803 to check locally Use git commit with -s to add 'Signed-of-by: {EMAIL}' on impacted commits

opensearch-ci-bot commented 2 years ago

:white_check_mark:   DCO Check Passed 9d4b719a90091134b199b01246e53412d2c4b4fc

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Wrapper Validation success 9d4b719a90091134b199b01246e53412d2c4b4fc

opensearch-ci-bot commented 2 years ago

:white_check_mark:   Gradle Precommit success 91be63ce4b67a1a3a2de9183d46f93273aea4803