Closed mvanderlee closed 6 months ago
I got the same on empty cluster with opensearch operator :\
@vchirikov The shards did eventually initialize overnight.
To recover all other shards immediately I had to manually increase the concurrency limit to be higher than the number of stuck shards.
curl -k -XPUT "${CLUSTER_ADDRESS}/_cluster/settings" \
-H 'Content-Type:application/json' \
-d'{"transient":{"cluster.routing.allocation.node_initial_primaries_recoveries":20}}'
You can replace transient
with persistent
to make it a permanent change.
I found the stuck shards with this python snippet:
import pandas as pd
from opensearchpy import OpenSearch
os = OpenSearch(**cluster_option)
shards = os.cat.shards(format='json')
shards_df = pd.DataFrame(shards)
shards_df[(shards_df['state'] != 'STARTED') & (shards_df['prirep'] == 'p')]
Yep, I tried this, but it didn't help:
PUT /_cluster/settings
{
"transient": {
"cluster": {
"routing": {
"allocation.cluster_concurrent_rebalance": 20,
"allocation.node_concurrent_recoveries": 20,
"allocation.enable": "all"
}
}
}
}
I saw only previous value in GET /_cluster/allocation/explain?pretty
. I tried to scale up number of nodes from 2 to 3, but after cluster join I saw one shard in initializing state & one unassigned :\
And I gave up, since the cluster was empty, in fact , I just recreated it from scratch. After this I tried scale-up / scale-down and it was ok.
btw to see shard status you can use GET _cat/shards/<index_name>
--
upd: oh, didn't notice that you use node_initial_primaries_recoveries
, if I see this again I will try it next time.
[Triage - attendees 1 2 3 4 5] @mvanderlee Thanks for creating this issue; however, it isn't being accepted due to not enough detail to reproduce the issue, please consider searching the OpenSearch forums [1] or providing more detailed information to get into this state. Please feel free to open a new issue after addressing the reason.
@peternied Here is how I reached the state twice in a row (same cluster)
At a minimum I'd expect some dialog regarding things to try to figure out why it's stuck. Or a "Can't reproduce, add more logging to shard recovery"
@mvanderlee Thanks for following up - we'd need more information to reproduce the issue. Can you share your operating system, how you are running the distribution, and then write out each action to get into this state?
@peternied
AWS Linux 2 on EC2 r5.4xlarge dockerized deployment, via docker-compose.
All we did was enable a windows detector in security analytics.
We then noticed memory issues, so increased the EC2 instance.
When it stayed unavailable I investigated and ended up finding 4 .opensearch-sap-windows-detectors-queries-xxxxx
indices stuck in initializing
. After I increased the node_initial_primaries_recoveries
to 20, there were 13 stuck.
In total there were 66 .opensearch-sap-windows-detectors-queries-xxxxx
indices and the stuck ones were not in numerical order.
This happened twice in a row (we started at r5.xlarge, then r5.2xlarge, then r5.4xlarge)
@mvanderlee One last critical piece of critical information, please review the opensearch.log, and if you see unhandled exceptions / errors please include them with context and then that would make it clear if this is an OpenSearch issue and what area of the product is impacted
Note; I'd recommend reviewing all log entries before publishing logs directly as they could contain data you consider sensitive
@peternied I really wish there was something useful in the logs, but as I mentioned there isn't.
The only exceptions/errors in the logs is all shards failed
Which isn't useful as it's cause by the shards not recovering, and only in the logs because the dashboard is trying to connect to a broken cluster.
{"type": "server", "timestamp": "2024-02-20T18:25:20,494Z", "level": "INFO", "component": "o.o.p.PluginsService", "cluster.name": "opensearch-cluster", "node.name": "opensearch-node1", "message": "PluginService:onIndexModule index:[.opensearch-sap-windows-detectors-queries-
000053/o_yMIQC6RYC8Wai__ILFLw]", "cluster.uuid": "ExUx3KUNRmmYdFbXoROhJw", "node.id": "8ocPYpOtSwWVbXiyJtAPQA" }
{"type": "server", "timestamp": "2024-02-20T18:25:20,558Z", "level": "INFO", "component": "o.o.p.PluginsService", "cluster.name": "opensearch-cluster", "node.name": "opensearch-node1", "message": "PluginService:onIndexModule index:[.opensearch-sap-windows-detectors-queries-
000055/a_KGOCToTcq4PQvKEVQbZA]", "cluster.uuid": "ExUx3KUNRmmYdFbXoROhJw", "node.id": "8ocPYpOtSwWVbXiyJtAPQA" }
{"type": "server", "timestamp": "2024-02-20T18:25:20,612Z", "level": "INFO", "component": "o.o.p.PluginsService", "cluster.name": "opensearch-cluster", "node.name": "opensearch-node1", "message": "PluginService:onIndexModule index:[.opensearch-sap-windows-detectors-queries-
000054/piVrJCWjT8GG7OMEqpAomg]", "cluster.uuid": "ExUx3KUNRmmYdFbXoROhJw", "node.id": "8ocPYpOtSwWVbXiyJtAPQA" }
{"type": "server", "timestamp": "2024-02-20T18:25:20,706Z", "level": "WARN", "component": "r.suppressed", "cluster.name": "opensearch-cluster", "node.name": "opensearch-node1", "message": "path: /.kibana/_count, params: {index=.kibana}", "cluster.uuid": "ExUx3KUNRmmYdFbXoRO
hJw", "node.id": "8ocPYpOtSwWVbXiyJtAPQA" ,
"stacktrace": ["org.opensearch.action.search.SearchPhaseExecutionException: all shards failed",
"at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:706) [opensearch-2.11.1.jar:2.11.1]",
"at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:379) [opensearch-2.11.1.jar:2.11.1]",
"at org.opensearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:745) [opensearch-2.11.1.jar:2.11.1]",
"at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:503) [opensearch-2.11.1.jar:2.11.1]",
"at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$0(AbstractSearchAsyncAction.java:280) [opensearch-2.11.1.jar:2.11.1]",
"at org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:357) [opensearch-2.11.1.jar:2.11.1]",
"at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.11.1.jar:2.11.1]",
"at org.opensearch.threadpool.TaskAwareRunnable.doRun(TaskAwareRunnable.java:78) [opensearch-2.11.1.jar:2.11.1]",
"at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.11.1.jar:2.11.1]",
"at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:59) [opensearch-2.11.1.jar:2.11.1]",
"at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-2.11.1.jar:2.11.1]",
"at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.11.1.jar:2.11.1]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
"at java.lang.Thread.run(Thread.java:833) [?:?]"] }
{"type": "server", "timestamp": "2024-02-20T18:25:20,741Z", "level": "INFO", "component": "o.o.a.u.d.DestinationMigrationCoordinator", "cluster.name": "opensearch-cluster", "node.name": "opensearch-node1", "message": "Detected cluster change event for destination migration", "cluster.uuid": "ExUx3KUNRmmYdFbXoROhJw", "node.id": "8ocPYpOtSwWVbXiyJtAPQA" }
If you have a non-public way for me to share the entire logs, I'd gladly do so.
@mvanderlee Thanks for taking a look, sounds like an ugly issue. Reach out to me, @ Peter Nied, on our Slack instance - https://opensearch.org/slack.html we can discuss next steps.
@peternied I ran into this again with opensearch operator, looks like it restart nodes too quickly and this causes unrecoverable split-brain problem. Currently I have cluster with 2 nodes, but I saw this on 3 node cluster.
3h+ stuck in recovery of 5kb index
@vchirikov Could you open an new issue with your reproduction, in the OpenSearch operator repository?
@mvanderlee Thanks for reaching out offline on the slack, I've taking a look at the logs.
I get a picture of the cluster state with the following exception - it looks like the SecurityAnalytics plugin (see messages with SecurityAnalyticsException.java ) & alerting plugin (see messages with DestinationMigrationCoordinator ) is in a very tight request/retry loop that is trying to fetch data and being rejected because there are too many requests in flight and there is too much memory in use (see messages with SearchBackpressureService ) - from the inner cause rejection the completed tasks number is 25M items.
You can troubleshoot further by using the _cat/tasks
API to find any long running tasks and _cat/pending_tasks
API to see what long running tasks, maybe this can point to a specific operation that is stuck/hogging resources that can be canceled/cleared. If you can afford to play around with the configuration starting up the service with some plugins disabled might help let the startup phase success and then onboard ism, then alerting, then other plugins.
I'd recommend troubleshooting those errors by looking at with forum.opensearch.org. It seems like there could be a 'task' explosion due to the index management plugin or one of the other plugins.
This is the bounds of my expertise - best of luck.
Okay, so it appears then that the cluster is reporting itself as ready and starts to ingest before it's fully recovered. But in our case, the number of events coming in caused so much backpressure that it interfered with the recovery phase. Until something this https://github.com/opensearch-project/OpenSearch/issues/11061 is implemented, we'd really need to ensure that we're not ingesting logs until the cluster is fully recovered.
This will require a custom sidecar application that does a detailed healthcheck on the cluster and provides an HTTP endpoint for the proxy or load balancer to call.
@mvanderlee
hot_threads
might be one way to check if there are genuinely threads stuck or is there something missing in the shard state being currently updated.
Describe the bug
13 shards are still stuck in 'initializing' 3 hours after node restart (single node cluster) Initially this caused all indices to be unavailable until I increased
cluster.routing.allocation.node_initial_primaries_recoveries
to 20I can not find any logs or other information to try and debug this issue, any guidance would be appreciated.
Related component
Other
To Reproduce
I have no idea.
We have a single node cluster and enabled windows detectors yesterday. Nothing but trouble since then.
Expected behavior
Cluster should be able to restart and have all indices come back online. At minimum there should be logging and or timeouts when recovering indices.
Additional Details
OpenSearch 2.11.1