Closed soosinha closed 1 year ago
Able to reproduce this locally
curl -XPOST http://${LEADER}/fruit-1/_close
curl -u 'admin:admin' -XPUT "http://${LEADER}/fruit-1/_settings" -H 'Content-Type: application/json' -d \
'{
"settings": {
"analysis": {
"analyzer": {
"std_folded": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
}
}'
curl -XPUT "http://${LEADER}/fruit-1/_mapping?pretty" -H 'Content-type: application/json' \
-d '
{
"properties": {
"my_text": {
"type": "text",
"analyzer": "std_folded"
}
}
}'
curl -XPOST http://${LEADER}/fruit-1/_open
Then indexing documents
curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_status?pretty"
{
"status" : "PAUSED",
"reason" : "AutoPaused: + [[fruit-1][0] - org.opensearch.OpenSearchException - \"analyzer [std_folded] has not been configured in mappings\"], ",
"leader_alias" : "leader-cluster",
"leader_index" : "fruit-1",
"follower_index" : "fruit-1"
}
HI @soosinha, A user will only be able to update the static settings of leader index only after closing the index. When cross cluster replication is set for an index and the index is closed on the leader, if there is get changes request during this time then the replication will go into autopaused state, ex:
curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_status?pretty"
{
"status" : "SYNCING",
"reason" : "User initiated",
"leader_alias" : "leader-cluster",
"leader_index" : "fruit-1",
"follower_index" : "fruit-1",
"syncing_details" : {
"leader_checkpoint" : 0,
"follower_checkpoint" : 0,
"seq_no" : 1
}
}
❯ curl -XPOST http://localhost:9200/fruit-1/_close
{"acknowledged":true,"shards_acknowledged":true,"indices":{"fruit-1":{"closed":true}}}%
❯
curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_status?pretty"
{
"error" : {
"root_cause" : [
{
"type" : "replication_exception",
"reason" : "failed to fetch replication status"
}
],
"type" : "replication_exception",
"reason" : "failed to fetch replication status"
},
"status" : 500
}
❯
curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_status?pretty"
{
"status" : "PAUSED",
"reason" : "AutoPaused: + [[fruit-1][0] - org.opensearch.indices.IndexClosedException - \"closed\"], ",
"leader_alias" : "leader-cluster",
"leader_index" : "fruit-1",
"follower_index" : "fruit-1"
}
If the user closes the index and opens it again, it is possible that getChanges request comes in during this time and Auto-pause the replication. However it is also possible that close and open is done so quickly and the there is no getChanges request between them, hence leaving the replication in syncing state.
If a user does the following
if the above is performed instantaneously, then we see that the replication goes to auto pause with a different reason
curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_status?pretty"
{
"status" : "PAUSED",
"reason" : "AutoPaused: + [[fruit-1][0] - org.opensearch.OpenSearchException - \"analyzer [std_folded] has not been configured in mappings\"], ",
"leader_alias" : "leader-cluster",
"leader_index" : "fruit-1",
"follower_index" : "fruit-1"
}
To overcome the above the user must simply pause the replicaiton and then update the index settings on leader index and then resume the replication. This will lead to leader index settings to be replicated on follower index.
Testing details
When resume replication is triggered new persistent tasks are spinned up and the leader index settings are synced by IndexReplicationTask.
Adding testing details below:
{
"status" : "SYNCING",
"reason" : "User initiated",
"leader_alias" : "leader-cluster",
"leader_index" : "fruit-1",
"follower_index" : "fruit-1",
"syncing_details" : {
"leader_checkpoint" : 0,
"follower_checkpoint" : 0,
"seq_no" : 1
}
}
❯ chmod 777 pause_resume.sh
❯ curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_pause" -H 'Content-Type: application/json' -d '{}'
{"acknowledged":true}%
❯
curl -XPOST http://localhost:9200/fruit-1/_close
{"acknowledged":true,"shards_acknowledged":true,"indices":{"fruit-1":{"closed":true}}}%
❯ curl -u 'admin:admin' -XPUT "http://localhost:9200/fruit-1/_settings" -H 'Content-Type: application/json' -d \
'{
"settings": {
"analysis": {
"analyzer": {
"std_folded": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
}
}
}
}
}'
curl -XPUT "http://localhost:9200/fruit-1/_mapping?pretty" -H 'Content-type: application/json' \
-d '
{
"properties": {
"my_text": {
"type": "text",
"analyzer": "std_folded"
}
}
}'
{"acknowledged":true}{
"acknowledged" : true
}
❯ curl -XPOST http://localhost:9200/fruit-1/_open
{"acknowledged":true,"shards_acknowledged":true}%
❯ curl -XPOST "http://localhost:9200/fruit-1/_doc/99" -H 'Content-Type: application/json' -d '{"value" : "data99", "my_text": "monu singh"}'
{"_index":"fruit-1","_id":"99","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":1,"_primary_term":3}%
❯ curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_resume" -H 'Content-Type: application/json' -d '{}'
{"acknowledged":true}%
❯ curl -XPOST "http://localhost:9200/fruit-1/_doc/98" -H 'Content-Type: application/json' -d '{"value" : "data98", "my_text": "monu singh"}'
{"_index":"fruit-1","_id":"98","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":2,"_primary_term":3}%
❯ curl -k -u 'admin:admin' "http://localhost:9201/_plugins/_replication/fruit-1/_status?pretty"
{
"status" : "SYNCING",
"reason" : "User initiated",
"leader_alias" : "leader-cluster",
"leader_index" : "fruit-1",
"follower_index" : "fruit-1",
"syncing_details" : {
"leader_checkpoint" : 2,
"follower_checkpoint" : 2,
"seq_no" : 3
}
}
❯ curl "localhost:9201/fruit-1/_settings?include_defaults=true"
{"fruit-1":{"settings":{"index":{"replication":{"type":"DOCUMENT"},"number_of_shards":"1","translog":{"generation_threshold_size":"32mb"},"plugins":{"replication":{"follower":{"leader_index":"leader-cluster:fruit-1"}}},"provided_name":"fruit-1","creation_date":"1692334252284","analysis":{"analyzer":{"std_folded":{"filter":["lowercase"],"type":"custom","tokenizer":"standard"}}},"number_of_replicas":"1","uuid":"8SlRCbGTQ5mfaq_YAcgq2A","version":{"created":"137217827"}}},"defaults":{"index":{"flush_after_merge":"512mb","plugins":{"replication":{"translog":{"retention_size":"536870912b","retention_lease":{"pruning":{"enabled":"false"}}}}},"final_pipeline":"_none","max_inner_result_window":"100","unassigned":{"node_left":{"delayed_timeout":"1m"}},"max_terms_count":"65536","routing_partition_size":"1","force_memory_term_dictionary":"false","max_docvalue_fields_search":"100","merge":{"scheduler":{"max_thread_count":"4","auto_throttle":"true","max_merge_count":"9"},"policy":{"reclaim_deletes_weight":"2.0","floor_segment":"2097152b","max_merge_at_once":"10","max_merged_segment":"5368709120b","expunge_deletes_allowed":"10.0","segments_per_tier":"10.0","deletes_pct_allowed":"20.0"}},"max_refresh_listeners":"1000","max_regex_length":"1000","load_fixed_bitset_filters_eagerly":"true","number_of_routing_shards":"1","write":{"wait_for_active_shards":"1"},"verified_before_close":"false","mapping":{"coerce":"false","nested_fields":{"limit":"50"},"depth":{"limit":"20"},"field_name_length":{"limit":"9223372036854775807"},"total_fields":{"limit":"1000"},"nested_objects":{"limit":"10000"},"ignore_malformed":"false"},"soft_deletes":{"enabled":"true","retention":{"operations":"0"},"retention_lease":{"period":"12h"}},"max_script_fields":"32","query":{"default_field":["*"],"parse":{"allow_unmapped_fields":"true"}},"format":"0","sort":{"missing":[],"mode":[],"field":[],"order":[]},"priority":"1","codec":"default","max_rescore_window":"10000","max_adjacency_matrix_filters":"100","analyze":{"max_token_count":"10000"},"gc_deletes":"60s","searchable_snapshot":{"repository":"","index":{"id":""},"snapshot_id":{"name":"","uuid":""}},"optimize_auto_generated_id":"true","max_ngram_diff":"1","hidden":"false","translog":{"flush_threshold_size":"512mb","sync_interval":"5s","retention":{"size":"-1","age":"-1"},"durability":"REQUEST"},"auto_expand_replicas":"false","mapper":{"dynamic":"true"},"recovery":{"type":""},"requests":{"cache":{"enable":"true"}},"data_path":"","merge_on_flush":{"enabled":"true","max_full_flush_merge_wait_time":"10s","policy":"default"},"highlight":{"max_analyzed_offset":"1000000"},"routing":{"rebalance":{"enable":"all"},"allocation":{"enable":"all","total_shards_per_node":"-1"}},"search":{"slowlog":{"level":"TRACE","threshold":{"fetch":{"warn":"-1","trace":"-1","debug":"-1","info":"-1"},"query":{"warn":"-1","trace":"-1","debug":"-1","info":"-1"}}},"default_pipeline":"_none","idle":{"after":"30s"},"throttled":"false"},"fielddata":{"cache":"node"},"codec.compression_level":"3","default_pipeline":"_none","max_slices_per_scroll":"1024","shard":{"check_on_startup":"false"},"max_slices_per_pit":"1024","allocation":{"max_retries":"5","existing_shards_allocator":"gateway_allocator"},"refresh_interval":"1s","indexing":{"slowlog":{"reformat":"true","threshold":{"index":{"warn":"-1","trace":"-1","debug":"-1","info":"-1"}},"source":"1000","level":"TRACE"}},"remote_store":{"translog":{"buffer_interval":"650ms"}},"compound_format":"0.1","blocks":{"metadata":"false","read":"false","read_only_allow_delete":"false","read_only":"false","write":"false"},"max_result_window":"10000","store":{"hybrid":{"mmap":{"extensions":["nvd","dvd","tim","tip","dim","kdd","kdi","cfs","doc"]},"nio":{"extensions":["segments_N","write.lock","si","cfe","fnm","fdx","fdt","pos","pay","nvm","dvm","tvx","tvd","liv","dii","vec","vem"]}},"stats_refresh_interval":"10s","type":"","fs":{"fs_lock":"native"},"preload":[]},"queries":{"cache":{"enabled":"true"}},"warmer":{"enabled":"true"},"max_shingle_diff":"3","query_string":{"lenient":"false"}}}}}%
As we can see from the last output, replication is in SYNCING state and the leader index mapping std_folded
is now synced on follower index.
Thanks @monusingh-1 for working on this and verifying the behavior. If the analyzer settings were dynamic, only then it would have been a bug. But since the index has to be closed before updating the analyzer settings, auto-pause of replication is the expected behavior
What is the bug? When customer adds new mappings in the leader index and these mappings are dependent on analyzers which are newly defined in the settings, the replay fails on the follower side. As per this logic, the follower tries to apply the operations directly. If the operations need mapping update, it then tries to sync remote mapping. But the syncing of remote mapping will fail if the settings have not been synced by the metadata polling task which happens every 1 minute.
How can one reproduce the bug? Steps to reproduce the behavior:
What is the expected behavior? The replication should work successfully by syncing all the settings and mappings
Do you have any additional context? This problem can be solved by syncing the remote settings before syncing the mappings here