opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.84k stars 1.83k forks source link

[BUG] OpenSearch not starting - [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block]; #14791

Open DumboJetEngine opened 4 months ago

DumboJetEngine commented 4 months ago

Describe the bug

After indexing some documents, my OpenSearch (v.2.9.0) is not starting. In the opensearch.log file, I get this error:

org.opensearch.cluster.block.ClusterBlockException: index [.opensearch-observability] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];
    at org.opensearch.cluster.block.ClusterBlocks.indicesBlockedException(ClusterBlocks.java:243) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.cluster.block.ClusterBlocks.indexBlockedException(ClusterBlocks.java:221) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.action.admin.indices.create.TransportCreateIndexAction.checkBlock(TransportCreateIndexAction.java:95) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.action.admin.indices.create.TransportCreateIndexAction.checkBlock(TransportCreateIndexAction.java:56) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction.doStart(TransportClusterManagerNodeAction.java:237) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.action.support.clustermanager.TransportClusterManagerNodeAction$AsyncSingleAction$2.onNewClusterState(TransportClusterManagerNodeAction.java:336) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.cluster.ClusterStateObserver$ContextPreservingListener.onNewClusterState(ClusterStateObserver.java:380) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.cluster.ClusterStateObserver$ObserverClusterStateListener.clusterChanged(ClusterStateObserver.java:230) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:625) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:613) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:577) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:484) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:186) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282) ~[opensearch-2.9.0.jar:2.9.0]
    at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245) ~[opensearch-2.9.0.jar:2.9.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
    at java.lang.Thread.run(Thread.java:833) ~[?:?]

I have emptied my disk a bit (Win11 system) and I now have 64GB free, but the error didn't go away.

I've found some related ElasticSearch questions, but I don't think I can use any of the solutions, given that I can't even start OpenSearch: https://stackoverflow.com/questions/48155774/elasticsearch-read-only-allow-delete-auto-setting

BTW, my free disk storage space was at 16%. For ElasticSearch they say that 15% is required to be free, in my provided link.

Then, I have freed up more space and my disk had 25% free space. Nothing changed with OpenSearch...

Then, I have moved the OpenSearch folder to an external disk with 75% free space. Again, nothing changed with OpenSearch... This is insane! What is wrong with it?

Here are the complete logs: https://gist.github.com/DumboJetEngine/8cdeaf9159e9af14831a352b1f5a3128

This is the second time I get this problem. I start a new installation from scratch, reindex my documents, and the next day that I start OpenSearch I get this issue.

There is also this error, which apparently is related to the security plugin:

Failure No shard available for [org.opensearch.action.get.MultiGetShardRequest@27795f5b] retrieving configuration for [INTERNALUSERS, ACTIONGROUPS, CONFIG, ROLES, ROLESMAPPING, TENANTS, NODESDN, WHITELIST, ALLOWLIST, AUDIT] (index=.opendistro_security)

...according to this.

Related component

Other

To Reproduce

I guess, indexing 2.5 million documents with OpenSearch 2.9.0.

Expected behavior

OpenSearch starts up.

Additional Details

Plugins [2024-07-17T13:57:57,386][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-alerting] [2024-07-17T13:57:57,386][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-anomaly-detection] [2024-07-17T13:57:57,386][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-asynchronous-search] [2024-07-17T13:57:57,386][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-cross-cluster-replication] [2024-07-17T13:57:57,388][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-geospatial] [2024-07-17T13:57:57,388][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-index-management] [2024-07-17T13:57:57,388][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-job-scheduler] [2024-07-17T13:57:57,389][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-knn] [2024-07-17T13:57:57,389][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-ml] [2024-07-17T13:57:57,389][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-neural-search] [2024-07-17T13:57:57,389][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-notifications] [2024-07-17T13:57:57,390][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-notifications-core] [2024-07-17T13:57:57,390][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-observability] [2024-07-17T13:57:57,390][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-reports-scheduler] [2024-07-17T13:57:57,390][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-security] [2024-07-17T13:57:57,391][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-security-analytics] [2024-07-17T13:57:57,391][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-sql]

Host/Environment (please complete the following information):

peternied commented 4 months ago

[Triage - attendees 1 2 3] @DumboJetEngine Thanks for creating this issue and including the detailed logs. While you wait for an area expert to engage I'd recommend visiting our forums [1] which is a better support channel in the event this isn't a bug.

dblock commented 4 months ago

This is likely not a bug, but definitely feels like one. @DumboJetEngine maybe try to dig which exact metrics caused TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block? It seems like you have enough disk space, but maybe you can dig through the logs to find the exact numbers and figure out why OpenSearch thinks otherwise?

DumboJetEngine commented 4 months ago

@dblock Any idea which log file to look into? I have already given you the complete opensearch.log file, because I thought this was the most relevant one. In there, it says: using [1] data paths, mounts [[Expansion (D:)]], net usable_space [5.5tb], net total_space [7.2tb], types [NTFS]. This is from the external drive I tested it on.

I don't know if it is possible that it remembers the fact that it had less free space during an older run.

dblock commented 4 months ago

I really don't know. I would read the code from where the error is thrown back to figure out what metrics it uses, modify the code to log those values which seems to be missing, then look at the numbers of those metrics. It's quite a bit of work :) I am sure there's a rational explanation to why this error happens.

DumboJetEngine commented 4 months ago

@dblock BTW, here is my config file, just in case I am doing something stupid that can cause this:

######## Start OpenSearch Security Demo Configuration ######## 
# WARNING: revise all the lines below before you go into production 
plugins.security.ssl.transport.enabled: true
plugins.security.ssl.transport.pemcert_filepath: .\_certs\server_app.myapp.local.crt
plugins.security.ssl.transport.pemkey_filepath: .\_certs\server_app.myapp.local.key
plugins.security.ssl.transport.pemtrustedcas_filepath: .\_certs\app.myapp.local_CA.pem
plugins.security.ssl.transport.enabled_protocols:
  - "TLSv1.2" 
plugins.security.ssl.transport.enforce_hostname_verification: false

plugins.security.ssl.http.enabled: true 
plugins.security.ssl.http.pemcert_filepath: .\_certs\server_app.myapp.local.crt
plugins.security.ssl.http.pemkey_filepath:  .\_certs\server_app.myapp.local.key
plugins.security.ssl.http.pemtrustedcas_filepath: .\_certs\app.myapp.local_CA.pem
plugins.security.ssl.http.enabled_protocols:
  - "TLSv1.2" 

plugins.security.allow_unsafe_democertificates: true 
plugins.security.allow_default_init_securityindex: true 
plugins.security.authcz.admin_dn: 
  - CN=,O=,L=Greece,ST=Greece,C=GR
plugins.security.disabled: false
plugins.security.nodes_dn:
  - 'CN=opensearch.myapp.local'

plugins.security.audit.type: internal_opensearch 
plugins.security.enable_snapshot_restore_privilege: true 
plugins.security.check_snapshot_restore_write_privileges: true 
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"] 
plugins.security.system_indices.enabled: true 
plugins.security.system_indices.indices: [".plugins-ml-config", ".plugins-ml-connector", ".plugins-ml-model-group", ".plugins-ml-model", ".plugins-ml-task", ".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opensearch-notifications-*", ".opensearch-notebooks", ".opensearch-observability", ".ql-datasources", ".opendistro-asynchronous-search-response*", ".replication-metadata-store", ".opensearch-knn-models"] 
node.max_local_storage_nodes: 3 
######## End OpenSearch Security Demo Configuration ######## 

network.host: 127.0.0.1
http.port: 9200

indices.query.bool.max_clause_count: 50000

The weird part is that it used to work fine for quite some time, and I can't figure out what has changed since then to break it. The amount of free storage is definitely one possibility, but that is restored now.

DumboJetEngine commented 4 months ago

@dblock If it would be of any help, I could probably upload somewhere the entire OpenSearch folder, together with the node data, for you to test locally (if you want to do so). I see it is 3GB uncompressed. Compressing it, brings it down to 1.9GB.

Or, I could patch a JAR file, if you decide modify the 2.9.0 code, to get more log information.

dblock commented 4 months ago

Appreciate your help. I hope someone can dig into this.

zane-neo commented 3 months ago

@DumboJetEngine @dblock I looked into this and I think this is a bug. When all nodes disk space under threshold, all the indices will be marked with READ_ONLY_ALLOW_DELETE blocks, these blocks also gets persisted in cluster state metadata. When the cluster spin up, the cluster state will be recovered from disk and applied to cluster, when the plugins create the system indices, the create index transport action first checks if the index has any block, and if yes, the index creation will fail with the error like the log posted.

A easier way to reproduce this issue is to spin up a small cluster or a single node and follow below steps:

  1. Set disk related cluster settings to a small percentage value, e.g.
    PUT /_cluster/settings
    {
    "persistent": {
            "cluster.routing.allocation.disk.threshold_enabled": true,
            "cluster.routing.allocation.disk.watermark.low": "50%",
            "cluster.routing.allocation.disk.watermark.high": "50%",
            "cluster.routing.allocation.disk.watermark.flood_stage": "50%"
        }
    }
  2. Create an index.
  3. Wait for around 30s and create the index again, you should see error like: blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];
  4. Change the cluster settings to a big value, e.g. 99% and update.
  5. Restart the cluster and create the same index immediately after cluster started, yuu should see same error above. Which means the index blocks are recovered from disk.

I have created this draft PR to fix this: https://github.com/opensearch-project/OpenSearch/pull/15258, @DumboJetEngine ,if you would like, I can post a opensearch zip created with the changes, and you can try with your data to see if this can fix the issue from your side.

Thanks.

danieltodea19 commented 2 months ago

Any news about this topic? I have the same issue on opensearch 2.11.1

DumboJetEngine commented 2 months ago

@zane-neo Sorry for the late response. I can try out your ZIP. I hope it would be compatible with the data files v.2.9.0 uses.

zane-neo commented 1 month ago

@DumboJetEngine @danieltodea19 I investigate further and can reproduce the issue locally, after testing I found the initial investigation is not correct. The actual root cause is:

  1. Observability plugin creates system index during cluster startup: link, and if cluster has index block state persisted, during the node startup this block will be applied to cluster state and index creation fails, then this exception is been thrown.
  2. OpenSearch starts cluster in this method and the error will be thrown again util the OpenSearch startup command execution finished.
  3. At the moment of the startup command complete, the JVM only has daemon threads without any non-daemon threads, and thus JVM exits, then cluster been shut down. I printed all the threads status during this time point(pasting part of them otherwise this is too long):
    [2024-09-29T11:28:47,433][INFO ][stdout                   ] [bcd07465d161] ----------------
    [2024-09-29T11:28:47,434][INFO ][stdout                   ] [bcd07465d161] Thread ID: 77
    [2024-09-29T11:28:47,434][INFO ][stdout                   ] [bcd07465d161] Name: opensearch[bcd07465d161][fetch_shard_started][T#4]
    [2024-09-29T11:28:47,434][INFO ][stdout                   ] [bcd07465d161] State: TIMED_WAITING
    [2024-09-29T11:28:47,434][INFO ][stdout                   ] [bcd07465d161] Is Daemon: true
    [2024-09-29T11:28:47,434][INFO ][stdout                   ] [bcd07465d161] ----------------
    [2024-09-29T11:28:47,435][INFO ][stdout                   ] [bcd07465d161] Thread ID: 78
    [2024-09-29T11:28:47,435][INFO ][stdout                   ] [bcd07465d161] Name: opensearch[bcd07465d161][fetch_shard_started][T#5]
    [2024-09-29T11:28:47,435][INFO ][stdout                   ] [bcd07465d161] State: TIMED_WAITING
    [2024-09-29T11:28:47,435][INFO ][stdout                   ] [bcd07465d161] Is Daemon: true
    [2024-09-29T11:28:47,435][INFO ][stdout                   ] [bcd07465d161] ----------------
    [2024-09-29T11:28:47,435][INFO ][stdout                   ] [bcd07465d161] Thread ID: 79
    [2024-09-29T11:28:47,435][INFO ][stdout                   ] [bcd07465d161] Name: DefaultDispatcher-worker-1
    [2024-09-29T11:28:47,435][INFO ][stdout                   ] [bcd07465d161] State: RUNNABLE
    [2024-09-29T11:28:47,435][INFO ][stdout                   ] [bcd07465d161] Is Daemon: true
    [2024-09-29T11:28:47,435][INFO ][stdout                   ] [bcd07465d161] ----------------
    [2024-09-29T11:28:47,436][INFO ][stdout                   ] [bcd07465d161] Thread ID: 80
    [2024-09-29T11:28:47,436][INFO ][stdout                   ] [bcd07465d161] Name: DefaultDispatcher-worker-2
    [2024-09-29T11:28:47,436][INFO ][stdout                   ] [bcd07465d161] State: TIMED_WAITING
    [2024-09-29T11:28:47,436][INFO ][stdout                   ] [bcd07465d161] Is Daemon: true
    [2024-09-29T11:28:47,436][INFO ][stdout                   ] [bcd07465d161] ----------------

    I've make code change on the cluster start part to first start the keepAliveThread which is a non-daemon thread to make sure the JVM won't exit.

And basically a quick and simple workaround for you is to remove(maybe backup first) the observability plugin and retry starting the cluster.

DumboJet commented 1 month ago

@zane-neo I have removed the folder opensearch-2.9.0\plugins\opensearch-observability and started OpenSearch. This time I still got the error, but it didn't exit and I got this interesting part logged in the console (related to relocation of data):

[2024-09-30T10:37:45,065][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [MYPC] Detected cluster change event for destination migration
[2024-09-30T10:37:45,118][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [MYPC] Detected cluster change event for destination migration
[2024-09-30T10:37:45,177][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [MYPC] Detected cluster change event for destination migration
[2024-09-30T10:37:45,217][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [MYPC] Detected cluster change event for destination migration
[2024-09-30T10:37:45,251][INFO ][o.o.p.PluginsService     ] [MYPC] PluginService:onIndexModule index:[.opensearch-sap-log-types-config/8NbxeSfdTKiAd5JHlqlktA]
[2024-09-30T10:37:45,273][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [MYPC] Detected cluster change event for destination migration
[2024-09-30T10:37:45,314][INFO ][o.o.c.r.a.AllocationService] [MYPC] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[.opensearch-sap-log-types-config][0]]]).
[2024-09-30T10:37:45,346][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [MYPC] Detected cluster change event for destination migration
[2024-09-30T10:37:53,578][INFO ][o.o.m.c.MLSyncUpCron     ] [MYPC] ML configuration already initialized, no action needed
[2024-09-30T10:38:13,560][WARN ][o.o.c.r.a.DiskThresholdMonitor] [MYPC] high disk watermark [90%] exceeded on [CxtRxAvFSTetihtxZgKYvg][MYPC][C:\Users\User1\Desktop\opensearch-2.9.0\data\nodes\0] free: 34.3gb[7.4%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete
[2024-09-30T10:38:43,547][INFO ][o.o.i.i.ManagedIndexCoordinator] [MYPC] Performing move cluster state metadata.
[2024-09-30T10:38:43,548][INFO ][o.o.i.i.MetadataService  ] [MYPC] ISM config index not exist, so we cancel the metadata migration job.
[2024-09-30T10:39:43,550][INFO ][o.o.i.i.ManagedIndexCoordinator] [MYPC] Cancel background move metadata process.
[2024-09-30T10:39:43,551][INFO ][o.o.i.i.ManagedIndexCoordinator] [MYPC] Performing move cluster state metadata.
[2024-09-30T10:39:43,551][INFO ][o.o.i.i.MetadataService  ] [MYPC] Move metadata has finished.
[2024-09-30T10:42:43,297][INFO ][o.o.j.s.JobSweeper       ] [MYPC] Running full sweep
[2024-09-30T10:42:43,562][INFO ][o.o.i.i.PluginVersionSweepCoordinator] [MYPC] Canceling sweep ism plugin version job
[2024-09-30T10:47:43,302][INFO ][o.o.j.s.JobSweeper       ] [MYPC] Running full sweep
[2024-09-30T10:52:43,306][INFO ][o.o.j.s.JobSweeper       ] [MYPC] Running full sweep
[2024-09-30T10:57:43,322][INFO ][o.o.j.s.JobSweeper       ] [MYPC] Running full sweep

Restoring the plugin folder continued to make the app exit, but that might be because I have filled my free disk space once again. I haven't managed to fully verify yet that OpenSearch is working fine without the plugin, because for some reason my SSL settings (and also some other settings related to searching) got ignored when the plugin was not present and the application that was using OpenSearch could not connect to it. But OpenSearch Dashboards did connect to it and I could run queries.

I am not sure why the settings get ignored, but I guess having OpenSearch at least work is an improvement, both in terms of functionality and in terms of troubleshooting. Thanks! :)

zane-neo commented 1 month ago

@DumboJet From the logs, it seems your node disk usage still higher than threshold thus the shard will be relocated away. A simple way to fix is once your cluster starts up, you can change the disk threshold cluster settings: cluster.routing.allocation.disk.watermark.flood_stage to a higher value, e.g. 99%, and then reinstall the plugin and restart the cluster. BTW, removing the plugin doesn't affect OpenSearch basic functionalities(read, write, search etc), only impact is you can't use the features provided by that plugin.