[BUG] Evaulate and update all generation fields in SegmentCommitInfo during primary promotion

dreamer-89 commented 1 year ago

Coming from https://github.com/opensearch-project/OpenSearch/pull/4365#issuecomment-1371087457, we need to relook into different SegmentCommitInfo generations and update if needed. As part of this bug, identify use of different generation fields in SegmentCommitInfo and possibly bump them as well.

Poojita-Raj commented 1 year ago

Looking into it

mhoffm-aiven commented 1 year ago

Hey, im not sure if its related but i have been running an experiment using segment replication using an opensearch 2.4.1 cluster with 9 nodes:

OpenSearchException[Segment Replication failed]; nested: IllegalStateException[Shard [opensearch_segment_replication_test_index-2023-01-26][40] has local copies of segments that differ from the primary]

I started facing it after adding 6 nodes to the cluster ( to have 15 in total ) and restarting the opensearch processes on all nodes to add some static configuration. Since then the cluster has been performing relocations and failing due to this error i suppose. All of this was performed under production load.

dreamer-89 commented 1 year ago

Hey, im not sure if its related but i have been running an experiment using segment replication using an opensearch 2.4.1 cluster with 9 nodes:
OpenSearchException[Segment Replication failed]; nested: IllegalStateException[Shard [opensearch_segment_replication_test_index-2023-01-26][40] has local copies of segments that differ from the primary]
I started facing it after adding 6 nodes to the cluster ( to have 15 in total ) and restarting the opensearch processes on all nodes to add some static configuration. Since then the cluster has been performing relocations and failing due to this error i suppose. All of this was performed under production load.

Thank you @mhoffm-aiven for sharing this issue. This is really important to make the feature more robust when tested under production workload. So, thank you again!

In order to repro this issue, can you please share the cluster configuration, index settings and type of operations (index/update/delete), workload size.

mhoffm-aiven commented 1 year ago

Hey @dreamer-89,

sorry i already deleted the cluster, from memory though, cluster configuration that was different from our default was

cluster.routing.allocation.cluster_concurrent_rebalance = 16

the indexes were created using this template

PUT /_index_templates/segment_replication_test
{
  "index_patterns": [
    "opensearch_segment_replication_test_index*"
  ],
  "template": {
    "settings": {
      "replication.type": "SEGMENT"
    }
  }
}

using these settings

    "settings": {
        "index": {
            "refresh_interval": "5s",
            "sort.field": "timestamp",
            "sort.order": "asc",
        },
        "number_of_replicas": 1,
        "number_of_shards": 52,
     },

I noticed that the service was relocating shards constantly after adding 6 nodes and restarting the cluster by accident ( i mistakenly configured cluster.routing.allocation.cluster_concurrent_rebalance with static config ). Checking after a while (2 days) why its not done relocating i saw that this bug happened pretty close to messing with the cluster by adding nodes and restarting.

Workload was _bulk with only index operations, and occasionally deletion of old indexes.

nknize commented 1 year ago

Workload was _bulk with only index operations, and occasionally deletion of old indexes.

@mhoffm-aiven, were _bulk index and "occasional index deletes" happening concurrently?

Curious, what was the total number of documents?

...occasionally deletion of old indexes.

Old segrep indexes? Or just old random indexes?

Also, I noticed these are time based indexes using index sorting. @dreamer-89, do we have a test for this configuration?

There is also a ShuffleForcedMergePolicy that interleaves old and new segments in a new combined segment to make old and recent doc search more efficient for time based sorted indexes. I'm sure you weren't using it but I'm making a note that we should add a segrep test using this merge policy to ensure no unexpected re-allocations.

dreamer-89 commented 1 year ago

Also, I noticed these are time based indexes using index sorting. @dreamer-89, do we have a test for this configuration?

Thank you @nknize for the comment. No, there is no test for this configuration. We will add one, thanks for bringing this up.

Regarding issue, I suspect it happens because of shard re-balancing after addition of new nodes. The addition of new nodes results in shard relocation where segment file conflicts after primary promotion and is a known issue. The fix for the issue recently merged into 2.x but did not make into 2.4.1. I am trying to write an integration test to mimic this scenario to confirm.

mhoffm-aiven commented 1 year ago

Hey @nknize,

There were _bulk index operations happening concurrently but not for indexes that were to be deleted ( those have been 2 days with no new documents ); deleted indexes were old segrep indexes yes.

nknize commented 1 year ago

@mhoffm-aiven this is helpful. Do you have any data characteristics we could use to repro a scenario w/ a similar synthetic test data set? Number of docs, number of fields? Did you change any of the default _bulk parameters?

and restarting the cluster by accident

Did you perform a full cluster or rolling restart?

mhoffm-aiven commented 1 year ago

Hey @nknize,

sorry for the late response, i was a bit swarmed. I checked, no _bulk parameters were changed, _bulk payloads are one megabyte in size, i dont know how many documents, it might change since it are log lines; i collected them into one megabyte payloads and then flushed by issuing the accumulated _bulk request. The mapping is basically a message text field and a couple auxiliary keywords. Throughput as a whole for this way roughly 50-60mb/s.

I did restart all opensearch processes (by accident) while in the process of draining and adding nodes iirc.

Do you need any more detail?

mch2 commented 1 year ago

We've since added better logging to print the segments in conflict when this is caught again.

In the meantime have done some digging on these gen fields.

SegmentCommitInfo#fieldInfosGen - this file name is driven from segmentInfo.name, which is bumped with counter. - link

SegmentCommitInfo#docValuesGen - Also driven from segmentInfo.name - link

SegmentCommitInfo#delGen - yes, .liv is driven from the delGen and is an issue. Our InternalEngine is only using soft deletes with indexWriter#softUpdateDocument, and in my unit tests I'm not able to generate .liv, will take a closer look here... The other issue is advancing this counter does not look possible given its package private in SegmentCommitInfos, nor do I think its wise to depend on this behavior going fwd.

Will take a look at https://github.com/opensearch-project/OpenSearch/issues/6369 in an effort to better select a new primary. The original counter increase was an extra guardrail to prevent against behind replicas getting select as primary & re-writing segments under the same name. We could easily identify this case on syncing replicas by considering primary term, but would clobber the file and break ongoing queries.

Poojita-Raj commented 1 year ago

The .liv files are named using the delGen values. However, in our current implementation, internalEngine is only using softDeletes so we would not be creating any .liv files that would have to be copied over to the replica.

Since there are no .liv files, this scenario will not present a problem in the current implementation, so closing it. We could revisit it if any change in implementation of deletes occur.

opensearch-project / OpenSearch

[BUG] Evaulate and update all generation fields in SegmentCommitInfo during primary promotion #5701