Open DumboJetEngine opened 4 months ago
This is likely not a bug, but definitely feels like one. @DumboJetEngine maybe try to dig which exact metrics caused TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block
? It seems like you have enough disk space, but maybe you can dig through the logs to find the exact numbers and figure out why OpenSearch thinks otherwise?
@dblock
Any idea which log file to look into?
I have already given you the complete opensearch.log
file, because I thought this was the most relevant one.
In there, it says: using [1] data paths, mounts [[Expansion (D:)]], net usable_space [5.5tb], net total_space [7.2tb], types [NTFS]
.
This is from the external drive I tested it on.
I don't know if it is possible that it remembers the fact that it had less free space during an older run.
I really don't know. I would read the code from where the error is thrown back to figure out what metrics it uses, modify the code to log those values which seems to be missing, then look at the numbers of those metrics. It's quite a bit of work :) I am sure there's a rational explanation to why this error happens.
@dblock BTW, here is my config file, just in case I am doing something stupid that can cause this:
######## Start OpenSearch Security Demo Configuration ########
# WARNING: revise all the lines below before you go into production
plugins.security.ssl.transport.enabled: true
plugins.security.ssl.transport.pemcert_filepath: .\_certs\server_app.myapp.local.crt
plugins.security.ssl.transport.pemkey_filepath: .\_certs\server_app.myapp.local.key
plugins.security.ssl.transport.pemtrustedcas_filepath: .\_certs\app.myapp.local_CA.pem
plugins.security.ssl.transport.enabled_protocols:
- "TLSv1.2"
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: .\_certs\server_app.myapp.local.crt
plugins.security.ssl.http.pemkey_filepath: .\_certs\server_app.myapp.local.key
plugins.security.ssl.http.pemtrustedcas_filepath: .\_certs\app.myapp.local_CA.pem
plugins.security.ssl.http.enabled_protocols:
- "TLSv1.2"
plugins.security.allow_unsafe_democertificates: true
plugins.security.allow_default_init_securityindex: true
plugins.security.authcz.admin_dn:
- CN=,O=,L=Greece,ST=Greece,C=GR
plugins.security.disabled: false
plugins.security.nodes_dn:
- 'CN=opensearch.myapp.local'
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices: [".plugins-ml-config", ".plugins-ml-connector", ".plugins-ml-model-group", ".plugins-ml-model", ".plugins-ml-task", ".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opensearch-notifications-*", ".opensearch-notebooks", ".opensearch-observability", ".ql-datasources", ".opendistro-asynchronous-search-response*", ".replication-metadata-store", ".opensearch-knn-models"]
node.max_local_storage_nodes: 3
######## End OpenSearch Security Demo Configuration ########
network.host: 127.0.0.1
http.port: 9200
indices.query.bool.max_clause_count: 50000
The weird part is that it used to work fine for quite some time, and I can't figure out what has changed since then to break it. The amount of free storage is definitely one possibility, but that is restored now.
@dblock If it would be of any help, I could probably upload somewhere the entire OpenSearch folder, together with the node data, for you to test locally (if you want to do so). I see it is 3GB uncompressed. Compressing it, brings it down to 1.9GB.
Or, I could patch a JAR file, if you decide modify the 2.9.0 code, to get more log information.
Appreciate your help. I hope someone can dig into this.
@DumboJetEngine @dblock I looked into this and I think this is a bug. When all nodes disk space under threshold, all the indices will be marked with READ_ONLY_ALLOW_DELETE
blocks, these blocks also gets persisted in cluster state metadata.
When the cluster spin up, the cluster state will be recovered from disk and applied to cluster, when the plugins create the system indices, the create index transport action first checks if the index has any block, and if yes, the index creation will fail with the error like the log posted.
A easier way to reproduce this issue is to spin up a small cluster or a single node and follow below steps:
PUT /_cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.threshold_enabled": true,
"cluster.routing.allocation.disk.watermark.low": "50%",
"cluster.routing.allocation.disk.watermark.high": "50%",
"cluster.routing.allocation.disk.watermark.flood_stage": "50%"
}
}
blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];
I have created this draft PR to fix this: https://github.com/opensearch-project/OpenSearch/pull/15258, @DumboJetEngine ,if you would like, I can post a opensearch zip created with the changes, and you can try with your data to see if this can fix the issue from your side.
Thanks.
Any news about this topic? I have the same issue on opensearch 2.11.1
@zane-neo Sorry for the late response. I can try out your ZIP. I hope it would be compatible with the data files v.2.9.0 uses.
@DumboJetEngine @danieltodea19 I investigate further and can reproduce the issue locally, after testing I found the initial investigation is not correct. The actual root cause is:
[2024-09-29T11:28:47,433][INFO ][stdout ] [bcd07465d161] ----------------
[2024-09-29T11:28:47,434][INFO ][stdout ] [bcd07465d161] Thread ID: 77
[2024-09-29T11:28:47,434][INFO ][stdout ] [bcd07465d161] Name: opensearch[bcd07465d161][fetch_shard_started][T#4]
[2024-09-29T11:28:47,434][INFO ][stdout ] [bcd07465d161] State: TIMED_WAITING
[2024-09-29T11:28:47,434][INFO ][stdout ] [bcd07465d161] Is Daemon: true
[2024-09-29T11:28:47,434][INFO ][stdout ] [bcd07465d161] ----------------
[2024-09-29T11:28:47,435][INFO ][stdout ] [bcd07465d161] Thread ID: 78
[2024-09-29T11:28:47,435][INFO ][stdout ] [bcd07465d161] Name: opensearch[bcd07465d161][fetch_shard_started][T#5]
[2024-09-29T11:28:47,435][INFO ][stdout ] [bcd07465d161] State: TIMED_WAITING
[2024-09-29T11:28:47,435][INFO ][stdout ] [bcd07465d161] Is Daemon: true
[2024-09-29T11:28:47,435][INFO ][stdout ] [bcd07465d161] ----------------
[2024-09-29T11:28:47,435][INFO ][stdout ] [bcd07465d161] Thread ID: 79
[2024-09-29T11:28:47,435][INFO ][stdout ] [bcd07465d161] Name: DefaultDispatcher-worker-1
[2024-09-29T11:28:47,435][INFO ][stdout ] [bcd07465d161] State: RUNNABLE
[2024-09-29T11:28:47,435][INFO ][stdout ] [bcd07465d161] Is Daemon: true
[2024-09-29T11:28:47,435][INFO ][stdout ] [bcd07465d161] ----------------
[2024-09-29T11:28:47,436][INFO ][stdout ] [bcd07465d161] Thread ID: 80
[2024-09-29T11:28:47,436][INFO ][stdout ] [bcd07465d161] Name: DefaultDispatcher-worker-2
[2024-09-29T11:28:47,436][INFO ][stdout ] [bcd07465d161] State: TIMED_WAITING
[2024-09-29T11:28:47,436][INFO ][stdout ] [bcd07465d161] Is Daemon: true
[2024-09-29T11:28:47,436][INFO ][stdout ] [bcd07465d161] ----------------
I've make code change on the cluster start part to first start the keepAliveThread
which is a non-daemon thread to make sure the JVM won't exit.
And basically a quick and simple workaround for you is to remove(maybe backup first) the observability plugin and retry starting the cluster.
@zane-neo
I have removed the folder opensearch-2.9.0\plugins\opensearch-observability
and started OpenSearch.
This time I still got the error, but it didn't exit and I got this interesting part logged in the console (related to relocation of data):
[2024-09-30T10:37:45,065][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [MYPC] Detected cluster change event for destination migration
[2024-09-30T10:37:45,118][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [MYPC] Detected cluster change event for destination migration
[2024-09-30T10:37:45,177][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [MYPC] Detected cluster change event for destination migration
[2024-09-30T10:37:45,217][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [MYPC] Detected cluster change event for destination migration
[2024-09-30T10:37:45,251][INFO ][o.o.p.PluginsService ] [MYPC] PluginService:onIndexModule index:[.opensearch-sap-log-types-config/8NbxeSfdTKiAd5JHlqlktA]
[2024-09-30T10:37:45,273][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [MYPC] Detected cluster change event for destination migration
[2024-09-30T10:37:45,314][INFO ][o.o.c.r.a.AllocationService] [MYPC] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[.opensearch-sap-log-types-config][0]]]).
[2024-09-30T10:37:45,346][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [MYPC] Detected cluster change event for destination migration
[2024-09-30T10:37:53,578][INFO ][o.o.m.c.MLSyncUpCron ] [MYPC] ML configuration already initialized, no action needed
[2024-09-30T10:38:13,560][WARN ][o.o.c.r.a.DiskThresholdMonitor] [MYPC] high disk watermark [90%] exceeded on [CxtRxAvFSTetihtxZgKYvg][MYPC][C:\Users\User1\Desktop\opensearch-2.9.0\data\nodes\0] free: 34.3gb[7.4%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete
[2024-09-30T10:38:43,547][INFO ][o.o.i.i.ManagedIndexCoordinator] [MYPC] Performing move cluster state metadata.
[2024-09-30T10:38:43,548][INFO ][o.o.i.i.MetadataService ] [MYPC] ISM config index not exist, so we cancel the metadata migration job.
[2024-09-30T10:39:43,550][INFO ][o.o.i.i.ManagedIndexCoordinator] [MYPC] Cancel background move metadata process.
[2024-09-30T10:39:43,551][INFO ][o.o.i.i.ManagedIndexCoordinator] [MYPC] Performing move cluster state metadata.
[2024-09-30T10:39:43,551][INFO ][o.o.i.i.MetadataService ] [MYPC] Move metadata has finished.
[2024-09-30T10:42:43,297][INFO ][o.o.j.s.JobSweeper ] [MYPC] Running full sweep
[2024-09-30T10:42:43,562][INFO ][o.o.i.i.PluginVersionSweepCoordinator] [MYPC] Canceling sweep ism plugin version job
[2024-09-30T10:47:43,302][INFO ][o.o.j.s.JobSweeper ] [MYPC] Running full sweep
[2024-09-30T10:52:43,306][INFO ][o.o.j.s.JobSweeper ] [MYPC] Running full sweep
[2024-09-30T10:57:43,322][INFO ][o.o.j.s.JobSweeper ] [MYPC] Running full sweep
Restoring the plugin folder continued to make the app exit, but that might be because I have filled my free disk space once again. I haven't managed to fully verify yet that OpenSearch is working fine without the plugin, because for some reason my SSL settings (and also some other settings related to searching) got ignored when the plugin was not present and the application that was using OpenSearch could not connect to it. But OpenSearch Dashboards did connect to it and I could run queries.
I am not sure why the settings get ignored, but I guess having OpenSearch at least work is an improvement, both in terms of functionality and in terms of troubleshooting. Thanks! :)
@DumboJet From the logs, it seems your node disk usage still higher than threshold thus the shard will be relocated away. A simple way to fix is once your cluster starts up, you can change the disk threshold cluster settings: cluster.routing.allocation.disk.watermark.flood_stage
to a higher value, e.g. 99%, and then reinstall the plugin and restart the cluster. BTW, removing the plugin doesn't affect OpenSearch basic functionalities(read, write, search etc), only impact is you can't use the features provided by that plugin.
Describe the bug
After indexing some documents, my OpenSearch (v.2.9.0) is not starting. In the
opensearch.log
file, I get this error:I have emptied my disk a bit (Win11 system) and I now have 64GB free, but the error didn't go away.
I've found some related ElasticSearch questions, but I don't think I can use any of the solutions, given that I can't even start OpenSearch: https://stackoverflow.com/questions/48155774/elasticsearch-read-only-allow-delete-auto-setting
BTW, my free disk storage space was at 16%. For ElasticSearch they say that 15% is required to be free, in my provided link.
Then, I have freed up more space and my disk had 25% free space. Nothing changed with OpenSearch...
Then, I have moved the OpenSearch folder to an external disk with 75% free space. Again, nothing changed with OpenSearch... This is insane! What is wrong with it?
Here are the complete logs: https://gist.github.com/DumboJetEngine/8cdeaf9159e9af14831a352b1f5a3128
This is the second time I get this problem. I start a new installation from scratch, reindex my documents, and the next day that I start OpenSearch I get this issue.
There is also this error, which apparently is related to the security plugin:
...according to this.
Related component
Other
To Reproduce
I guess, indexing 2.5 million documents with OpenSearch 2.9.0.
Expected behavior
OpenSearch starts up.
Additional Details
Plugins [2024-07-17T13:57:57,386][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-alerting] [2024-07-17T13:57:57,386][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-anomaly-detection] [2024-07-17T13:57:57,386][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-asynchronous-search] [2024-07-17T13:57:57,386][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-cross-cluster-replication] [2024-07-17T13:57:57,388][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-geospatial] [2024-07-17T13:57:57,388][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-index-management] [2024-07-17T13:57:57,388][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-job-scheduler] [2024-07-17T13:57:57,389][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-knn] [2024-07-17T13:57:57,389][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-ml] [2024-07-17T13:57:57,389][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-neural-search] [2024-07-17T13:57:57,389][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-notifications] [2024-07-17T13:57:57,390][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-notifications-core] [2024-07-17T13:57:57,390][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-observability] [2024-07-17T13:57:57,390][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-reports-scheduler] [2024-07-17T13:57:57,390][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-security] [2024-07-17T13:57:57,391][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-security-analytics] [2024-07-17T13:57:57,391][INFO ][o.o.p.PluginsService ] [PC123] loaded plugin [opensearch-sql]
Host/Environment (please complete the following information):