[BUG] Updating some search backpressure settings crash the cluster

gaobinlong commented 2 months ago

Describe the bug

This issue comes from the forum: https://forum.opensearch.org/t/unable-to-start-opensearch-loop-failed-to-apply-settings-and-rate-must-be-greater-than-zero/20908.

When update the setting search_backpressure.cancellation_burst(deprecated), search_backpressure.search_task.cancellation_burst or search_backpressure.search_shard_task.cancellation_burst to an non-default value, the cluster fails to apply the settings and throws org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: rate must be greater than zero, the cluster gets stuck in it and all operations on the master node fail, even restarting the cluster doesn't work.

Related component

Cluster Manager

To Reproduce

Update setting
```
PUT _cluster/settings
{
"persistent": {
"search_backpressure.search_task.cancellation_burst": 11
}
}
```
, to avoid making your cluster never come back even after restarting it, you can change persistent to transient.

Expected behavior

Fix the bug.

Additional Details

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context Add any other context about the problem here.

jainankitk commented 2 months ago

@kaushalmahi12 - Can you look into this? While @gaobinlong already has PR for cancellation_burst setting, let us validate other settings for Search Backpressure and Workload Management

rajiv-kv commented 2 months ago

[Triage - attendees 1 2 3] - @jainankitk / @gaobinlong - Can we add more details/stacktraces around as to why the cluster-manager fails to come back after restart ?

reta commented 2 months ago

@gaobinlong mind please updating the documentation for these settings [1], thank you

[1] https://github.com/opensearch-project/documentation-website/blob/main/_tuning-your-cluster/availability-and-recovery/search-backpressure.md

gaobinlong commented 1 month ago

@gaobinlong mind please updating the documentation for these settings [1], thank you

[1] https://github.com/opensearch-project/documentation-website/blob/main/_tuning-your-cluster/availability-and-recovery/search-backpressure.md

Thanks @reta, I've created a documentation PR for it: https://github.com/opensearch-project/documentation-website/pull/8555.

gaobinlong commented 1 month ago

[Triage - attendees 1 2 3] - @jainankitk / @gaobinlong - Can we add more details/stacktraces around as to why the cluster-manager fails to come back after restart ?

Here're the stacktraces:

[2024-08-20T09:22:27,818][INFO ][o.o.c.s.ClusterApplierService] [opensearch-master-data-node-33] cluster-manager node changed {previous [{opensearch-master-data-node-33}{yOd-Z9CZR82IUxPxee3KrQ}{ik8U02GyQfSyYwQd_JqNNw}{172.24.0.33}{172.24.0.33:9300}{dimr}{shard_indexing_pressure_enabled=true}], current []}, term: 21261, version: 96514, reason: becoming candidate: clusterApplier#onNewClusterState
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.metadata.perf_analyzer.state] from [] to [0]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.cluster_concurrent_rebalance] from [2] to [5]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [2] to [8]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.routing.allocation.node_concurrent_outgoing_recoveries] from [2] to [8]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [indices.recovery.max_bytes_per_sec] from [41943040b] to [500mb]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [indices.recovery.max_concurrent_file_chunks] from [2] to [5]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [indices.recovery.max_concurrent_operations] from [1] to [4]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [cluster.max_shards_per_node] from [1000] to [3000]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [plugins.index_state_management.template_migration.control] from [0] to [-1]
[2024-08-20T09:22:27,819][INFO ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] updating [search_backpressure.cancellation_burst] from [10.0] to [10]
[2024-08-20T09:22:27,819][WARN ][o.o.c.s.ClusterSettings  ] [opensearch-master-data-node-33] failed to apply settings
org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: rate must be greater than zero
    at org.opensearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:209) ~[opensearch-core-2.12.0.jar:2.12.0]
    at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.notifyListeners(SearchShardTaskSettings.java:275) ~[opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.setCancellationBurst(SearchShardTaskSettings.java:257) ~[opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.common.settings.Setting$Updater.apply(Setting.java:1254) ~[opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:696) ~[opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:232) [opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:558) [opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:486) [opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:188) [opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:283) [opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:246) [opensearch-2.12.0.jar:2.12.0]
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
    at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.IllegalArgumentException: rate must be greater than zero
    at org.opensearch.common.util.TokenBucket.<init>(TokenBucket.java:52) ~[opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.common.util.TokenBucket.<init>(TokenBucket.java:47) ~[opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.search.backpressure.SearchBackpressureState.onRateChanged(SearchBackpressureState.java:95) ~[opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.search.backpressure.SearchBackpressureState.onBurstChanged(SearchBackpressureState.java:101) ~[opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.lambda$setCancellationBurst$2(SearchShardTaskSettings.java:257) ~[opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.search.backpressure.settings.SearchShardTaskSettings.notifyListeners(SearchShardTaskSettings.java:269) ~[opensearch-2.12.0.jar:2.12.0]
    ... 13 more

, the cluster_manger is not able to apply the invalid settings because the cluster state is corrupt, after execute ./bin/opensearch-node remove-settings search_backpressure.cancellation_burst, to remove the invalid settings from the cluster state, then the cluster comes back.

opensearch-project / OpenSearch