Closed sandervandegeijn closed 1 year ago
Hi @ict-one-nl, I'm blocked from downloading that video with this error:
Why have I been blocked?
This website is using a security service to protect itself from online attacks. The action you just performed triggered the security solution. There are several actions that could trigger this block including submitting a certain word or phrase, a SQL command or malformed data.
Taking a wild guess since I can't see the error, but the parameter value you need to pass to the restore API is remote_snapshot
not remote_storage
.
Hi, I'm using cloudflare, blocking traffic from outside the EU; but that's not viable, so I removed the WAF-rule. You should be able to download the video now.
Your comment is valid, but I made a typo in the report (but not in the API-calls) :)
Side-effect: index was marked as red, which triggered cluster state as red, which stopped data-ingest entirely. Removed the searchable snapshot just now: cluster green and everything is okay again.
Okay, got one step further, at node boot:
2023-08-07T14:29:35,537][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-search-nodes-1] uncaught exception in thread [main] │
│ opensearch-search-nodes org.opensearch.bootstrap.StartupException: SettingsException[Unable to initialize the search-data node: Missing value for configuration node.search.cache.size] │
│ opensearch-search-nodes at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:184) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes Caused by: org.opensearch.common.settings.SettingsException: Unable to initialize the search-data node: Missing value for configuration node.search.cache.size │
│ opensearch-search-nodes at org.opensearch.node.Node.initializeFileCache(Node.java:1745) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.node.Node.<init>(Node.java:632) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.node.Node.<init>(Node.java:389) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:180) ~[opensearch-2.9.0.jar:2.9.0] │
│ opensearch-search-nodes ... 6 more │
│ opensearch-search-nodes SettingsException[Unable to initialize the search-data node: Missing value for configuration node.search.cache.size] │
│ opensearch-search-nodes at org.opensearch.node.Node.initializeFileCache(Node.java:1745) │
│ opensearch-search-nodes at org.opensearch.node.Node.<init>(Node.java:632) │
│ opensearch-search-nodes at org.opensearch.node.Node.<init>(Node.java:389) │
│ opensearch-search-nodes at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242) │
│ opensearch-search-nodes at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242) │
│ opensearch-search-nodes at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404) │
│ opensearch-search-nodes at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:180) │
│ opensearch-search-nodes at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171) │
│ Stream closed EOF for mdtitss-siem/opensearch-search-nodes-1 (create-volume-keystore) │
│ opensearch-search-nodes at org.opensearch.cli.Command.main(Command.java:101) │
│ opensearch-search-nodes at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137) │
│ opensearch-search-nodes at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103) │
│ opensearch-search-nodes For complete error details, refer to the log at /usr/share/opensearch/logs/siem.log
Set this 5000mb, error disappears. I find this a bit cumbersome:
Yet, still:
@ict-one-nl Thank you. Definitely a bug here. The long overflow is coming from this line. Given that file sizes approaching Long.MAX_VALUE
are implausibly large, I suspect we're encountering an exception and hitting this fallback path. It is unfortunate there is no debug logging there. Any chance you can enable assertions in your JVM (by passing -ea
as a command line parameter when launching the Java process) on the off chance it will hit one of those assert statements and we can get some more info here?
Alternatively or in addition to, is there any chance you could test this with OpenSearch 2.8? There were some changes in this code related to the upgrade to AWS SDK v2 that could be related, but those were just done in version 2.9.
Thanks, sure I'll try to provide some more details. Give me a day or so.
What do you think about the cache size setting? Should that be a separate issue?
Behavior on 2.8.0 is definitely different from 2.9.0. I'm able to restore the snapshot:
And it searches:
But index state is yellow and thus the cluster state is yellow as well. I'm guessing because it has 1 primary shard and 1 replica (the default), but it can't assign the replica. Is this default correct?
Update: Added a second search node: cluster state green. Should this be necessary and shouldn't this be documented? Will look into 2.9.0 now.
Update: created a small pull request on the docs to address the points I ran into.
okay, I do have a debug trace of trying to restore a remote snapshot on 2.9.0. Strangely it did seem to work once on the cluster I upgraded from 2.8.0 to 2.9.0 restoring a remote snapshot from 2.8.0 in a 2.9.0 cluster. Have to look if I can reproduce it, but the amount of logging and the performance impact makes that difficult.
Unfortunately the logs are quite big, we're talking several hundred megabytes in a matter of minutes when I put the whole node logging to debug. It also makes the cluster almost unresponsive.
status = error
appender.console.type = Console
appender.console.name = console
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = [%d{ISO8601}][%-5p][%-25c{1.}] [%node_name]%marker %m%n
rootLogger.level = error
rootLogger.appenderRef.console.ref = console
I'm hesitant to share the logs publicly, don't want to expose infrastructure details on a public forum. I would be open to share it privately with the dev team. Or I might be able to share more openly if it's not this much and I'm able to remove details from the logs.
@ict-one-nl Thanks for working through this! Are you still able to reproduce the error? If so, I'd like to see the results of enabling assertions. My suspicion is that it is reaching this line and there actually isn't any debug logging statements there. However, if you enable assertions, then the JVM will crash when that assertion is hit and should print the details to the stdout or stderr.
Regarding the cache size and replica behavior, you are absolutely right that we need better documentation. To quickly answer your questions:
search
role, then a default cache size will be chosen. Otherwise, you need to define a portion of the disk to use for this cache that will be unavailable for use by normal "local" indexes.search
nodes to allocate them as you have discovered. However, for many cases a replica may not be necessary and this setting can be changed at restore time.I'm consistently getting the error when using 2.9.0 for snapshot + restore. It's working consistently on 2.8.0. Strange enough a snapshot on 2.8.0 restored correctly as remote_snapshot on 2.9.0. While typing this comment and trying to enable assertions we are suffering a production problem on our kubernetes cluster, I'll come back to that later.
I'm hoping that we can solve this before the 2.10 release, otherwise I'll be stuck with this error for at least two more months :) The explanation you've given in this comment would be very useful in the docs, even if it's just a copy/paste from this ticket :) This info is missing from the docs at this time.
Update: you are right about the default setting btw, I triggered the error when I added the ingest role to the search nodes for a test.
Okay another round of testing, another result...
Situation:
POST _snapshot/s3-archive/testindex/_restore
{
"indices": "testindex",
"storage_type" : "remote_snapshot"
}
Cluster state yellow, index state yellow (recovery; different from before, then it was red).
GET _cluster/allocation/explain?pretty
{
}
{
"index": "testindex",
"shard": 0,
"primary": false,
"current_state": "unassigned",
"unassigned_info": {
"reason": "NEW_INDEX_RESTORED",
"at": "2023-08-09T17:47:23.947Z",
"details": "restore_source[s3-archive/testindex]",
"last_allocation_status": "no_attempt"
},
"can_allocate": "no",
"allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions": [
{
"node_id": "16KaBF9FSx6CV9YqqcqRew",
"node_name": "opensearch-search-nodes-1",
"transport_address": "10.244.20.234:9300",
"node_attributes": {
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "replica_after_primary_active",
"decision": "NO",
"explanation": "primary shard for this replica is not yet active"
},
{
"decider": "same_shard",
"decision": "NO",
"explanation": "a copy of this shard is already allocated to this node [[testindex][0], node[16KaBF9FSx6CV9YqqcqRew], [P], recovery_source[snapshot recovery [PGVR1bqrTOe683wkzxCPjQ] from s3-archive:testindex/JJq6xAsKQvGTm_q8u3H9Rg], s[INITIALIZING], a[id=DEE-ADOzTpahNJuqqP-Gug], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2023-08-09T17:47:23.947Z], delayed=false, details[restore_source[s3-archive/testindex]], allocation_status[fetching_shard_data]]]"
},
{
"decider": "throttling",
"decision": "NO",
"explanation": "primary shard for this replica is not yet active"
}
]
},
{
"node_id": "1bTLmPKITraN6XfFZNznPA",
"node_name": "opensearch-search-nodes-0",
"transport_address": "10.244.39.13:9300",
"node_attributes": {
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "replica_after_primary_active",
"decision": "NO",
"explanation": "primary shard for this replica is not yet active"
},
{
"decider": "throttling",
"decision": "NO",
"explanation": "primary shard for this replica is not yet active"
}
]
},
{
"node_id": "FggOXN-BRuKZatmU4kI-sw",
"node_name": "opensearch-data-nodes-hot-1",
"transport_address": "10.244.17.100:9300",
"node_attributes": {
"temp": "hot",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "replica_after_primary_active",
"decision": "NO",
"explanation": "primary shard for this replica is not yet active"
},
{
"decider": "throttling",
"decision": "NO",
"explanation": "primary shard for this replica is not yet active"
},
{
"decider": "target_pool",
"decision": "NO",
"explanation": "Routing pools are incompatible. Shard pool: [REMOTE_CAPABLE], Node Pool: [LOCAL_ONLY]"
}
]
},
{
"node_id": "e7pkrWmwRdCs21GnDupJxA",
"node_name": "opensearch-data-nodes-hot-2",
"transport_address": "10.244.45.149:9300",
"node_attributes": {
"temp": "hot",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "replica_after_primary_active",
"decision": "NO",
"explanation": "primary shard for this replica is not yet active"
},
{
"decider": "throttling",
"decision": "NO",
"explanation": "primary shard for this replica is not yet active"
},
{
"decider": "target_pool",
"decision": "NO",
"explanation": "Routing pools are incompatible. Shard pool: [REMOTE_CAPABLE], Node Pool: [LOCAL_ONLY]"
}
]
},
{
"node_id": "hgEZ0haURGiRajhrWawHFw",
"node_name": "opensearch-data-nodes-hot-0",
"transport_address": "10.244.46.163:9300",
"node_attributes": {
"temp": "hot",
"shard_indexing_pressure_enabled": "true"
},
"node_decision": "no",
"deciders": [
{
"decider": "replica_after_primary_active",
"decision": "NO",
"explanation": "primary shard for this replica is not yet active"
},
{
"decider": "throttling",
"decision": "NO",
"explanation": "primary shard for this replica is not yet active"
},
{
"decider": "target_pool",
"decision": "NO",
"explanation": "Routing pools are incompatible. Shard pool: [REMOTE_CAPABLE], Node Pool: [LOCAL_ONLY]"
}
]
}
]
}
Cluster config: 3 master nodes, 3 data nodes, 2 nodes with search + ingest roles.
Will ask the storage guys to drop everything from the S3 container tomorrow to get a total fresh environment to test further.
Okay, I need more time to get consistent scenario's. I do know 2.8.0 works like a charm, but I can't pinpoint the problem in 2.9.0. But I'm quite sure something is amiss. At this time we're not sure if it's:
Unless I'm mistaken there is no functional/technical need to implement the new SDK, it's more a matter of preventing technical debt. This is a valid concern, but not really a time sensitive one. Would it be an idea to:
This would buy us some time and prevents a potentially confusing point release for users (Do I need to upgrade, why is there a point release, is there a security issue?). If there is a interoperability problem between the s3 and the new SDK, it's likely more people will report it to the SDK devs so fixes can be made.
In the meantime I'm more than willing to test further and help debugging. I could look into exposing our s3 to the Opensearch devs (but I need to talk to our storage team for that, security policies and such). That would make debugging the issue much easier I suppose.
The SDK was upgraded as a prerequisite for performance improvements in repository-s3, using features only available in the newer version. Reverting the change will be a lot of work, so let's only do it if we're 100% confident we can't fix a problem while moving forward? I suggest we debug the problem first. Btw, there's another bug report that's potentially a side effect of the upgrade, https://github.com/opensearch-project/OpenSearch/issues/9265.
@raghuvanshraj any ideas?
Ah didn't know that, thank you. What can I do to help? Are there new requirements for the S3 endpoint? Can I do some low level testing with a command line client based on the new SDK to rule out opensearch itself and pinpoint the problem?
I think @andrross correctly called out where we have a problem (https://github.com/opensearch-project/OpenSearch/issues/9125#issuecomment-1668829058), but the only helpful thing to do would be to continue to narrow this down to the smallest repro possible.
Short of that, if we can't have an independent repro, and since you can reproduce fairly consistently, are you able to build the source dev guide? I would check-out the 2.9 branch, and add more debug logging around that error to find out what caused it.
Will try, I tried running the docker image with the -ea option in the OPENSEARCH_JAVA_OPTS env var but I didn't hit it. But I'm not sure the -ea flag has been picked up properly. Running opensearch on my mac is somewhat troublesome, so I need to revive my Windows laptop. WIll try to look into it asap.
Update: assertions did work, but dit not hit them. Created a totally fresh cluster and tested it again, result is a different error than in the first post. I made a full capture of the process with logging of all nodes in it. Contacted Andrew on Slack to discuss if I can share the grab with the dev team under TLP:AMBER.
With the help of Andrew talking me through it, giving me some pointers and the help of our storage team I think I finally nailed down the problem... Hard one to diagnose as the exceptions opensearch threw didn't set me on the right path.
The culprit seems to be this setting on the S3 endpoint on the s3: --object-lock-enabled-for-bucket | --no-object-lock-enabled-for-bucket (boolean)
Correct use: aws s3api create-bucket --bucket BUCKETNAME --object-lock-enabled-for-bucket
We had an inconsistancy between different buckets, this was the one difference we found. Enabled it and now it's working on the cluster that was consistently malfunctioning. Many thanks to our storage team and Andrew for both their patience ghehe.
I suppose the docs could be improved that this is required for the searchable snapshots to function. Exception handling could be improved so it doesn't throw these odd exceptions. What do you think?
edit: testscenario. prerequisite is that the data snapshot repo exists:
DELETE testindex
POST testindex/_doc
{
"data" : "test"
}
GET testindex/_search
{
"query": {
"match_all": {}
}
}
PUT /_snapshot/data/testindex
{
"indices": "testindex"
}
DELETE testindex
POST _snapshot/data/testindex/_restore
{
"indices": "testindex",
"storage_type" : "remote_snapshot"
}
GET _cluster/health
GET testindex/_search
{
"query": {
"match_all": {}
}
}
GET testindex/_settings
{
}
DELETE testindex
This has been fixed. The 2.9.1 patch release did not happen but this fix will be included in the upcoming 2.10 release.
@andrross, @bbarani do we know the blockers or overhead in doing patch release. Ideally all bug fixes should make it to a patch release. What happens to users who have upgraded to 2.9(we don't support downgrades natively). Does that mean users are supposed to stay stuck with this bug till we get to 2.10
@Bukhtawar @bbarani There are no blockers to doing a patch release. Ultimately it was decided to move forward with the 2.10 release in lieu of a patch release. You can track the conversation here.
Good news is I've been able to confirm the fix :)
Describe the bug I'm trying to use the searchable snapshots feature, but restoring the snapshot fails as remote_snapshot fails. Picture > 1000 words, so I made a video detailing the steps and effects:
https://download.ict-one.nl/searchable_snapshots.mp4
Config:
Exception:
To Reproduce Steps to reproduce the behavior: See video
Expected behavior Snapshot restored and available for searching
Plugins s3 repository plugin
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information): Default Opensearch 2.9.0 docker image with S3 plugin enabled. S3 storage is a s3 appliance, but since the normal restore works I don't suspect that's the issue?
Additional context Add any other context about the problem here.