opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.75k stars 1.81k forks source link

Safety/correctness checker for OpenSearch software #3866

Open Bukhtawar opened 2 years ago

Bukhtawar commented 2 years ago

Is your feature request related to a problem? Please describe.

High complexity increases the probability of human error in design, code, and operations. Errors in the system could cause loss or corruption of data, or violate other interface contracts that our customers depend on. We should also have formal methods to ensure the correctness of our replication protocols. TLA+ model for verification is an industry standard and maybe with segrep we could probably introduce one

With deep changes being made to core espl in the Engine with the new segment based replication strategy and it's integration with remote segment store and translog, it becomes imperative that we re-evaluate safety, correctness and data integrity properties of the system.

Describe the solution you'd like Mechanism to inject failures and verify system behaviour for correctness, safety and data integrity

Describe alternatives you've considered Jepsen is one of the various softwares(not AL2.0) that provides similar capabilities.

Additional context Add any other context or screenshots about the feature request here.

Bukhtawar commented 2 years ago

cc: @mch2 @sachinpkale @ashking94 @kartg @andrross @reta @nknize Looking for thoughts on adding this framework.

reta commented 2 years ago

+1 to look into Jepsen, Elastic has formal TLA+ models published (AL 2.0) [1] which could be quite helpful, although no updates since 2019.

[1] https://github.com/elastic/elasticsearch-formal-models

andrross commented 2 years ago

I think this is great idea @Bukhtawar. What do you think the best way to get started is? We could start from the TLA+ models referenced above and modify as necessary (if anything is out of date from 2019). Or we could start fresh and build on top of that.

sachinpkale commented 2 years ago

I completely agree with the proposal to add correctness checker. We started listing down failure scenarios of remote segment store and realized that there are many corner cases and almost impossible to cover them manually.

Bukhtawar commented 2 years ago

@andrross I guess we could start from the existing models that are there and see if they need an update or need more comprehensive cases for remote store and recent engine changes that we are making