Develop a tool to generate "evil"/edge case datasets for OpenSearch

This proposal has been significantly modified as of 11/5. The original proposal can be expanded at the bottom of this post. Note that comments before 11/5 refer to the original proposal.

Is your feature request related to a problem? Please describe.

The behavior of ElasticSearch/OpenSearch changes between versions in ways both intentional (new features or datatypes, deprecated features) and unintentional (bugs). These changes impact upgrades and the decisions users make around them.

It would be useful to have a dataset that intentionally probed edge cases and behavior changes and could be used to verify the behavior of various versions and in various situations. As an example of how we would use this dataset: we could upload it to a cluster, run a series of queries, migrate/upgrade the cluster, and re-run the queries to ensure that the behavior has stayed the same (or changed only in expected ways).

This would be a living dataset—there will be many cases relevant to future versions of OpenSearch that we're not aware of today.

Describe alternatives you've considered

The backwards compatibility tests (especially the Full Cluster Restart tests) use some randomized testing data (source), including a few that seem targeted towards specific edge cases (e.g. the "field.with.dots" field name), but they're quite small and limited. Additionally, they largely don't focus on edge cases and there's no concept of behavior changing across versions.

The other related material I've found is the OpenSearch Benchmark Workloads. Per my understanding, this a collection of datasets with accompanying operations—index, update, aggregations to run, etc. The datasets seem to cover a broad list of realistic usecase scenarios, and are therefore interesting, but tailored to a different purpose. None of them seem to intentionally target the edge cases of interest in this case.

A previous version of this proposal suggested that datasets should be randomly generated with the option to test scale or performance related limits. In this version, the initial suggestion has been scaled down to focus specifically on edge cases that are specifically reproducible.

Describe the solution you'd like

I'd like to create a library of data "points" that can be used independently or together to

Each datapoint (not necessarily a single document) would be a directory that contains one or more of:

a bulk-upload formatted JSON file that contains the relevant data and is used to index it
a query (TODO: details on formatting here?) that exercises the behavior in question
an expected result for the query (the answer key). There could be multiple expected results for different versions or other situations.
a README.md that explains in human-friendly terms what's being tested and under what conditions the expected result is actually expected.

Datapoints would generally create and use their own index to prevent interference between various tests. This also allows it to be hard-coded into the bulk json file and the query.

Phase 0: Data points are created with data & queries to illustrate known bugs, features, and API changes. They are run manually by a user (documentation provided) for each use case the user is interested in and the actual query result can be compared to the expected result.

Phase 1: A "runner" script is added that can take a list of test cases, run them all, and show which did not give the expected result.

Phase 2: Test cases can be tagged with specific versions or areas of interest (e.g. test cases for a specific plugin) and the runner script can select all datapoints meeting a specific use case.

After phase 1, this has a large amount of potential overlap with the future of #24 and the validation framework, so I haven't attempted to extrapolate too far down the path of what comes next.

Original Proposal

**Is your feature request related to a problem? Please describe.** I (and my team) would like to make use of a consistent dataset for testing on OpenSearch that emphasizes edge cases--in our case, this would be very helpful for testing migrations and upgrades. While there are plenty of sample datasets out there (some mentioned below), our hope for this one is that it's a fairly comprehensive dataset that can capture intentional or unintentional differences in behavior in various settings, such as different versions. A few categories we're aware of wanting to test: all currently existing data types, cases where dynamic field mapping behavior has changed, cases where new data formats were added, cases that approach the size limits for each field type, anywhere bugs have been fixed in various versions for ingestion or storage of specific field types. We're expecting to find more as we go and would love suggestions. This would be a living dataset—there will be cases relevant to future versions of OpenSearch that we're not aware of today. As an example of how we would use this dataset: we could upload it to a cluster, run a series of queries, migrate/upgrade the cluster, and re-run the queries to ensure that the behavior has stayed the same (or changed only in expected ways). **Describe alternatives you've considered** The backwards compatibility tests (especially the Full Cluster Restart tests) use some randomized testing data ([source](https://github.com/opensearch-project/OpenSearch/blob/main/qa/full-cluster-restart/src/test/java/org/opensearch/upgrades/FullClusterRestartIT.java#L112#L172)), including a few that seem targeted towards specific edge cases (e.g. the "field.with.dots" field name). The other related material I've found is the [OpenSearch Benchmark Workloads](https://github.com/opensearch-project/opensearch-benchmark-workloads). Per my understanding, this a collection of datasets with accompanying operations—index, update, aggregations to run, etc. The datasets seem to cover a broad list of realistic usecase scenarios, and are therefore interesting, but tailored to a different purpose. None of them seem to intentionally target the edge cases of interest in this case. I haven't come across other similar datasets, but would love to be pointed in their direction if they exist. **Describe the solution you'd like** Requirements for the dataset: 1. The dataset covers a large—approaching comprehensive—set of data types and test cases. 2. The datset is available in multiple sizes—in many cases a small number of documents (~1k?) is enough to use as testing data, but there are particularly migration/upgrade use-cases where we'll be curious about the performance with large sizes (hundreds of gigabytes into terabytes). 3. There is a set of queries and expected responses matching the dataset. Some of these will be along the lines of "how many times does 'elephant' occur in a given field" and others will be more like "what is the inferred field mapping for field X". 4. (optional) The "width" of the dataset (as opposed to the number of documents) can be reduced — a user can request a dataset that only has "basic" fields, without testing all edge cases. Or, similarly, a user can request only cases that test dynamic field mapping behavior or those that are relevant to a specific plugin. 5. (optional) The data is available in multiple export formats—csv, a json-like document ready for bulk upload, or piped directly to a cluster. 6. It is fairly simple to add a new field to the generator. Given the requirements outlined above, it seems more feasible to create a script to randomly generate appropriate data on demand than a fixed dataset. With this approach in place, there's a 7th requirement: 7. Following the pattern of the OpenSearch tests, a predictable dataset can be generated by providing a seed value. If not provided, the data will be random and the seed will be returned to the user. The user can provide (likely via a CLI) their requirements. For an MVP, this is probably just the number of documents (or total size of data) and an optional seed. Future iterations could accept the set of fields to include (requirement 4 above) and the output format (requirement 5). Setting aside input and export related functionality, the core of this script would be very similar to libraries like [faker.js](https://github.com/faker-js/faker)/[python faker](https://github.com/joke2k/faker) that generate realistic fake data, and looking into their architecture may be helpful. For some specialized fields, it's possible that leveraging one of these libraries could be useful. In the code, there needs to be a mapping between fields and functions to generate appropriate data. Many of these will be very basic—random alphanumeric string, random int, etc. with some more complicated ones (e.g. ip ranges or data that satisfies a specific edge case). Adding a new field to the dataset will require creating the generator function and adding it to the mapping with the field name. For each field that's added, there also may (or may not) be 1/ one or more queries associated with the field (and their expected values), and 2/ an index field mapping entry. It's possible that for some types of queries ("how many times does 'elephant' occur"), the randomized data is a poor match. As we encounter these cases, I think having a second, static dataset would be helpful. Adopting the benchmark workloads might be a good fit for this use case. **Specific Questions** 1. Does this (or something substantially similar) already exist? 2. Would you, as a potential user of this script/dataset, have additional requirements for it? 3. Pointers/feedback/thoughts on proposed solution?

I like this idea. The verification in our current backwards compatibility tests is closer to a shallow health check (since the focus of the test is typically to see if the cluster is healthy) rather than a deep test that vets the tricky edge cases - which seems to be what is proposing being here 👍

Some high-level thoughts:

Firstly, it seems to me that there are several sub-ideas in this proposal:

A static data set that is mean to trigger known edge cases in a given target OpenSearch version
An "answer key" that tells the user what the expected output is for a given target OpenSearch version
A "generator" that can create randomized data which triggers the edge cases above

It's probably better to track these in their own meta-issues since each of them are composed of several facets that change in applicability and expected behavior based on the version of OpenSearch being tested. Wdyt?

Secondly, I see that the proposal describes the data set as "emphasizes edge cases" but also "approaches a comprehensive set of data types and use-cases". Given that we've already got test data spread out across the BWC tests and the benchmarks suite, I'm concerned that trying to be comprehensive will end with this being "yet another standard". Instead, could we keep the data set just focused on edge cases?

On a similar note, I think we should also be conservative about adding performance-related use-cases to this data set - edge cases that are consistently reproducible are probably the only good candidates to include. Everything else should be covered by the benchmarks suite, IMO. I would also prefer to include such "evil performance" tests after we've fleshed out the "evil functionality" data set, since the performance tests will ideally require tooling to execute them against a cluster.

I want to make sure I’m understanding the objective correctly before I go too far. This tool would be able to generate a dataset (with particularly “evil” data) that we would expect to potentially cause a change in behavior/output when upgrading with a particular current OS version to a target OS version. This dataset could then be provided for users to test on their own test cluster. After this, I am not sure I follow point 3 concerning a set of queries and expected responses, is this something that would be valuable to a user or would it be mainly for say our own automated testing in OS? I imagine having a query a user could run on their initial cluster with “evil” data and then run again on their upgraded cluster to visually see some differences could be helpful but I’m curious for anyone’s thoughts here as I may be missing the intent.

I do share similar concerns about making the divide between BWC tests / Benchmark tests / “Evil” tests clear and not leaving ambiguity as to where a test case should belong. I am not aware of any comprehensive edge case testing that is done in OS, especially with respect to functional changes between versions, so this seems like a good unique area to target that would probably have more overlap with Lucene tests then anything currently in OS. As a starting point it may be useful to focus here even with static data as this gets rounded out.

This is a good beginning to what could be a very long work stream. Breaking this up into separate but related (cited) work streams, as kartg mentions, will likely provide better visibility.

My high level comments would be to carefully consider the complexity of the surfaces that you need to test. If the “devil’s in the details”, you’ll have to consider lots of different ways to tweak various details. Trying to do that in a single monolithic/shared test suite may turn into a maintenance nightmare - making sure that each new case doesn’t cause other cases to regress. Setting up ways to manage, reuse/compose the details and complexity will let others contribute additional tests and data as well as allow for tests that are cheaper to run - meaning that we can run them more often, which is always a good thing. In addition to the composability, modeling the data and validations’ metadata can provide super valuable feedback into other automated systems to provide differences across environments.

I understand kartg's concern that we have some datasets and you’re creating yet another one. However, if we want to find specific edge cases that triggered issues, it’s probably going to be easiest if we have a solid palette of stock contexts that we can pull from and then add one detail to create a breaking edge case. I’d recommend looking into how BWC are specified and what they’re testing. Long term, they should be unified and in such a way that minimizes the total effort. Meaning that if BWC tests are specified and have data already ready - use that. If there ARE additional things that you need to add, go through that refactoring exercise and start to prepare for a back port.

There are a couple key reasons that these edge tests can’t just lift the BWC tests wholesale. 1) Migration tests are expected to break, BWC tests are expected to work. 2) When tests break - or show differences - will those differences be expected. If they are, what metadata do we want to keep track of to let our tools reason about what the differences are. 3) Migration test scenarios will have many more contexts than the BWC tests - including many cases that we don’t know about a priori.

Smaller specific bits of feedback are as follows…

I would consider datasets as small as possible/necessary. You could vend some tools to slice and dice “kitchen sinks” of data, but when it comes time to actually running each test, any time that there was an issue, that should be its own test in perpetuity. Imagine that you start w/ a test w/ 25 different types of field mappings that have differences across some versions. I would expect to have at least 25 different tests for just single fields alone.

Some specific thoughts on modeling:

I’d consider data set generators in a granular & composable fashion.
Datasets should include not just values and documents for fields, but also configurations (plugins, clusters, infrastructure?, etc).
Tests should be able to specify what kinds of datasets they’re looking for. You have a couple options for this. A layer of indirection for types of data and types of tests will let you automatically join combinations together without needing to directly edit either test or data definitions.
An intermediary between data and tests can also provide metadata for use cases and other human understandable commentary. The human readable data should be able to be grown over time as the tests make new discoveries (find new issues).
Metadata will become a knowledge base. This is where you’ll want to put things like - before version X to after versions Y, the data from C class of data, will be modified such that queries Q will only be equivalent when both results are transformed via T_x and T_y. Let me know if that isn’t specific. There will be a lot of variables (and probably code too) - but the concept isn’t sophisticated.
Be mindful to not strongly couple any of the higher order metadata stuff to JUST data! Configurations will benefit from the same support environment.
I’m not sure how valuable faker will be. I don’t think there’s a problem in using it, but I see faker’s value in either exhaustive tests (which you could probably set up on your own w/ more control [I have no familiarity w/ faker]) OR for data that is meant to look real for humans to grok (don’t think that we’ll need to care there, but that would probably be nice to have if it works).
You’ll need to be mindful of licensing and copyright of any data that you’re pulling from 3Ps. Consider that licenses could change if you’re ever trying to download them on the fly (probably a bad idea for a number of reasons). Polluting the git repo w/ big files is a bad idea too though.

Thank you all for your feedback, this has been really helpful.

Addressing one clarification point first from @lewijacn 's comment:

This tool would be able to generate a dataset (with particularly “evil” data) that we would expect to potentially cause a change in behavior/output when upgrading with a particular current OS version to a target OS version. This dataset could then be provided for users to test on their own test cluster. [...] Concerning a set of queries and expected responses, is this something that would be valuable to a user or would it be mainly for say our own automated testing in OS?

Yes, I think the queries are going to be an integral part. Some issues might pop up from just indexing the dataset, but a lot more of the issues are going to be queries that break or data that is interpreted as a different type, for instance. Users won't be aware of these issues unless 1/ they're already running extensive tests and very proficient at looking into this (in which case, this might not benefit them much) or 2/ we provide them with the queries and the "answer key" to know what they should expect to see.

On to some of the other points, this is my summary of some of the common threads I see in the feedback.

This proposal attempts to solve a lot of problems at once through multiple mechanisms. Break both the problems and the solutions down into smaller pieces that can be designed, tracked, built, tested and used independently.
1. One of these is the matter of being a library of edge cases vs. a comprehensive dataset, with strong feedback towards being a library of edge cases.
2. Another is performance-related use cases vs. pure edge cases. I like Kartik's point here about focusing on consistently reproducible cases.
Carefully consider the overlap with the backwards compatibility tests from the perspective of not duplicating what they're already doing.
There's a lot more work to be done in terms of defining the data model that allows:
1. each edge case/issue to be its own test (necessary data + query), and
2. datasets/tests to be granular and composable.

Do you feel like that captures the bulk of your points?

I'm working on an updated proposal that slims this down to a much narrower first deliverable that emphasizes providing the minimum necessary data and queries to illustrate each issue, while focusing purely on edge cases. This will likely start as a purely static dataset, where the only axis of configurability is selecting which tests to include.

I've updated the proposal in the top comment to reflect the feedback and suggest a significantly different course. Please take a look and provide feedback on the new proposal.

opensearch-project / opensearch-migrations

Develop a tool to generate "evil"/edge case datasets for OpenSearch #9

This proposal has been significantly modified as of 11/5. The original proposal can be expanded at the bottom of this post. Note that comments before 11/5 refer to the original proposal.