This describes a platform to setup an OpenSearch cluster that replays traffic from a source cluster, possibly using a different version, and compares the traffic between the two clusters for performance and content differences. Setting up a "shadow cluster" like creates several benefits for Data Migration, Validation, and stress testing.
Clusters can stay in sync so that the downtime between switching between the source and a target cluster is minimized; Users can made aware of differences that will affect their workloads; and customers can compare the performance and costs of clusters under different but relevant traffic scenarios.
What users have asked for this feature?
Some users do shadow testing on their own already, though they're managing the entire process. Consultants assisting users have asked for ways to do "apples to apples" comparisons, especially for hardware differences.
What problems are you trying to solve?
When users want to migrate to a different version of OpenSearch or use a different hosting environment (node hardware) they want to be sure that they understand all of the differences in performance and results. If they decide to proceed with the merge or not, they want the process to be as smooth and inexpensive as possible.
What is the developer experience going to be?
This should not impact existing OpenSearch projects. Instead, it will create new tooling to interface with clusters. Plugins may be vended to enable traffic capture from the source cluster.
This solution will involve a number of related but independent components. These should use the most appropriate environments for each task (Python, Java, Node, etc). They should not make assumptions about the implementations of the other layers. The components should be programs/commands that are packaged as docker containers that can be composed together to create an E2E solution to test development. These containers may be deployed locally, or to a cloud infrastructure, such as AWS, where managed services like Kafka or Kinesis may replace simpler implementations that were used for local development
Are there any security considerations?
There are no new APIs, but this proposal propagates data from a source cluster to a new cluster. That propagation will be over several different channels. TLS will be used for all network communication and all systems persisting data will use encryption at rest. Granular authorization controls and strong network isolations should be supported for all of the components that are handling data.
Are there any breaking changes to the API
This should not break anything in the existing clusters, though there may be a minor to moderate performance hit against it when moving historical data and when capturing traffic.
What is the user experience going to be?
Are there breaking changes to the User Experience?
This creates a new User Experience. Existing ones should not be affected during a migration. Post migration, many things could be different, which is the point of the tools.
Why should it be built? Any reason not to?
This fills a common gap for users who would like to have a better understanding of differences between their original and prospective environments.
Reasons not to build this would all circle around difficulty. Just some of the concerns are listed in the Architecture
What will it take to execute?
There will be SOME performance and risk impact on the source clusters if they are NOT currently logging traffic. That impact may be around 10ms of added latency for a modification (PUT, DELETE, POST) and negligible (async) for all others. For this to be useful to users, they will also need to spend considerable resources and their own attention to understanding differences and managing a clean migration process.
Any remaining open questions?
Many - but the hope is that iterative development will help to close those down.
What are you proposing?
This describes a platform to setup an OpenSearch cluster that replays traffic from a source cluster, possibly using a different version, and compares the traffic between the two clusters for performance and content differences. Setting up a "shadow cluster" like creates several benefits for Data Migration, Validation, and stress testing.
Clusters can stay in sync so that the downtime between switching between the source and a target cluster is minimized; Users can made aware of differences that will affect their workloads; and customers can compare the performance and costs of clusters under different but relevant traffic scenarios.
What users have asked for this feature?
Some users do shadow testing on their own already, though they're managing the entire process. Consultants assisting users have asked for ways to do "apples to apples" comparisons, especially for hardware differences.
What problems are you trying to solve?
When users want to migrate to a different version of OpenSearch or use a different hosting environment (node hardware) they want to be sure that they understand all of the differences in performance and results. If they decide to proceed with the merge or not, they want the process to be as smooth and inexpensive as possible.
What is the developer experience going to be?
This should not impact existing OpenSearch projects. Instead, it will create new tooling to interface with clusters. Plugins may be vended to enable traffic capture from the source cluster.
This solution will involve a number of related but independent components. These should use the most appropriate environments for each task (Python, Java, Node, etc). They should not make assumptions about the implementations of the other layers. The components should be programs/commands that are packaged as docker containers that can be composed together to create an E2E solution to test development. These containers may be deployed locally, or to a cloud infrastructure, such as AWS, where managed services like Kafka or Kinesis may replace simpler implementations that were used for local development
Are there any security considerations?
There are no new APIs, but this proposal propagates data from a source cluster to a new cluster. That propagation will be over several different channels. TLS will be used for all network communication and all systems persisting data will use encryption at rest. Granular authorization controls and strong network isolations should be supported for all of the components that are handling data.
Are there any breaking changes to the API
This should not break anything in the existing clusters, though there may be a minor to moderate performance hit against it when moving historical data and when capturing traffic.
What is the user experience going to be?
Are there breaking changes to the User Experience?
This creates a new User Experience. Existing ones should not be affected during a migration. Post migration, many things could be different, which is the point of the tools.
Why should it be built? Any reason not to?
This fills a common gap for users who would like to have a better understanding of differences between their original and prospective environments.
Reasons not to build this would all circle around difficulty. Just some of the concerns are listed in the Architecture
What will it take to execute?
There will be SOME performance and risk impact on the source clusters if they are NOT currently logging traffic. That impact may be around 10ms of added latency for a modification (PUT, DELETE, POST) and negligible (async) for all others. For this to be useful to users, they will also need to spend considerable resources and their own attention to understanding differences and managing a clean migration process.
Any remaining open questions?
Many - but the hope is that iterative development will help to close those down.