[FEATURE] Data Streams support for cross cluster replication

soosinha commented 2 years ago

Is your feature request related to a problem? Please describe. Currently cross cluster replication is supported for indices only. It is not supported for data streams. Data streams was released in OpenSearch 1.0 (Reference: https://opensearch.org/docs/latest/opensearch/data-streams/). So we need to support data streams in cross cluster replication.

Describe the solution you'd like We should be able to trigger the start replication API with a data stream similar to an index.

Describe alternatives you've considered One workaround is to set auto follow on the backing index of the data stream. But it would be create a replication task for each of the index. But as the number of backing indices grows due to rollover, the number of replication tasks will increase and might affect cluster performance.

Additional context NA

tmanninger commented 2 months ago

Is there any plan for this feature?

shwetathareja commented 1 month ago

@tmanninger As of now, it may not be prioritized. We would be happy to take contributions from community on this and please feel free to start the discussion on the proposal.

tmanninger commented 1 month ago

I can try it...

My idea (datastream name is datastream-logs): Start replication:

PUT 'https://localhost:9200/_plugins/_replication/datastream-logs/_start?pretty' -d '
{
   "leader_alias": "my-connection-alias",
   "leader_index": "datastream-logs",
   "use_roles":{
      "leader_cluster_role": "all_access",
      "follower_cluster_role": "all_access"
   }
}'

When leader_index is a datastream, create datastream config (.replication-metadata-store):

id: datastream-logs
{
  "connection_name": "pay-prod-search-a",
  "metadata_type": "DATASTREAM",
 ....
}

Internally, create an Task, which monitors for new backend indices (config option plugins.replication.follower.datastream_fetch_interval: 30s): When new leader backend index is created, create replication task: .ds-datastream-logs.00000XX -> .ds-datastream-logs.00000XX

GET _plugins/_replication/follower-01/_status?pretty returns the status of all backend indices

Feedback?

soosinha commented 4 weeks ago

Can you provide a more detailed proposal ? The proposal should address the following:

Data streams has multiple backing indices. But only one of them is a write index while others are read index. So we may need only active replication on only one index. How do we address this while starting replication for a pre-existing data stream which already has some rolled-over read indices?
How do we create data-stream on follower with a pre-defined configuration (already existing read and write indices)? Is there a pre-existing API or do we need a new API ?
How do we monitor when a data-stream is rolled over ? Should there be an auto-follow task ?
Data stream indices are created using index templates. Are we going to replicate these templates as well ?

tmanninger commented 4 weeks ago

Data streams has multiple backing indices. But only one of them is a write index while others are read index. So we may need only active replication on only one index. How do we address this while starting replication for a pre-existing data stream which already has some rolled-over read indices?

That's not correct. You can update datastreams with "_update_by_query" API, which can update ALL backend indices (i just tested it in my test environment). Therefore, we need to replicate all existing backend indices.

How do we create data-stream on follower with a pre-defined configuration (already existing read and write indices)? Is there a pre-existing API or do we need a new API ?

How do we monitor when a data-stream is rolled over ? Should there be an auto-follow task ? We should create an auto-follow task, which monitors the backend indices of the leader datastream.

Initial datastream sync (leader have 3 backend indices): 1.) Copy settings and mappings from the leaders first backend index and create the datastream 2.) Copy settings and mappings from the leaders second index to the follower datastream and trigger an rollover 3.) Copy settings and mappings from the leaders third index to the follower datastream and trigger an rollover.

I don't know if there is a better way to create backend indices with the same mapping as the leader backend indices.

We need an autofollower-datastream task, which creates the backend indices in the initial phase and monitor for newly created backend indices.

What should happen, wenn the an backend index is deleted from the leader? Should the follower also delete the index? (i think yes...)?

Data stream indices are created using index templates. Are we going to replicate these templates as well ?

Leader and follower needs a consistency state with the same mapping. Therefore we need to copy the mapping.

opensearch-project / cross-cluster-replication

[FEATURE] Data Streams support for cross cluster replication #357