AndrewSisley commented 5 months ago

Proactive tests

At the moment all our tests are reactive, they are written by devs guessing at ways in which the database can fail. We should add some fancy magic to generate tests in order to seek out bugs, kind of like an advanced form of fuzz testing.

At least at first, it will be limited to comparing results between two branches (e.g. a refactor branch, and develop) - this would largely remove the need to predict (or interact with) the results of actions.

Project structure

The project is split into four distinct parts; firstly there is the set of supported action types, second the general purpose generator(s), third is the engine responsible for orchestrating everything related to generating tests, fourth is the test executor(s) responsible for executing generated tests.

Actions

Each supported action type will implement an common interface with two functions.

The first function (called Entropy for now) takes in a set of parameters (potentially an actual []any) (the first param would likely be the DefraDB instance, to allow accounting for existing state) and returns the number of possible valid and relevant action-values that could be generated.

For example, given a AddSchema action-type, and assuming an empty database for simplicity, there would be two parameters - the number of schema to generate (c), and, the maximum number of fields per schema (n). The number of potential scalar field types is a constant (NFTypes). Field names are not relevant/low-value to our testing and can be ignored as a variable, and we will not worry about directives yet. This gives us the function n ^ (NFTypes+2c) (2c represents possible relation fields, one for the object and one for the id field).

This entropy value allows the generator to encode any possible combination of exclusively useful valid values into binary format without needing to know the specifics of the action - as long as the generated code is equal to or smaller than the entropy, it will be valid and unique (uniqueness being desirable to avoid unwanted bias in action generation).

In some cases entropy may become unreasonably large (e.g. generating docs containing lots of large fields), in which case the value returned from Entropy would need to represent a sampling of all potentially useful test values, not the actual true entropy.

The second function (called ConstructValue for now) takes the encoded value generated by the generator (using Entropy) and decodes it into a valid test action.

The encoding would likely something similar to a BaseX numeric system, where each digit is treated as a bitmap item. The X in BaseX is variable depending on the location (e.g. when representing a field type on a single collection it would currently be Base16 (NFTypes)). The X does not need to be an even number and can be as large as we like.

It is likely that encoding/decoding values will require us to support mathematical operations on arbitrarily large integers. There appears to be no official Go packages for this, so we may have to write our own. If we do this, we should bear in mind the potential usefulness of such a package to production DefraDB and build accordingly.

Note

It would I think be theoretically be possible to generate the entropy for the entire test case given a set of initial limiting params, however that seems like it would be an unreasonably large value and it would make encoding/decoding quite a lot more complicated. I might be wrong however and it would be amazing if we could do that.

Generators

By removing correctness and bias as problems from value generation, selecting values becomes much simpler and the code should abstracted away allowing multiple generators to be implemented and swapped out.

Initially we can likely get away with using a simple random []byte generator, however long term we can experiment with replacing this with more fancy generators such as neural networks - potentially allowing us to predict actions most likely to catch bugs.

Engine

The action generation engine would coordinate the active generator, and test actions. It will likely take very few (if any) parameters, other than perhaps the action-types and generators to use, and output test action sets.

It should not be responsible for executing the actions - by splitting test generation and execution we allow the test generation aspect to be used for multiple means, initially that would just mean refactoring tests, but in the future it could include benchmarking, the change detector and other interesting cases.

Executors

The generated tests will need to be executed in order to be useful, initially I suggest we focus on comparative testing between branches to ensure that existing behaviour is preserved.

It is important that the tests executed are recorded somewhere (initially just in the terminal perhaps) so that they can be replayed and examined outside of this action-generation system - otherwise the usefulness of the system would be greatly reduced.

It seems likely that building the first executor will involve extracting shared stuff from the core integration test suite so that they may be shared between them. This will likely also be of benefit to the benchmark suite.

Warning

I am uncertain as to how useful this will end up being, it could prove to be too unreliable or costly to run, but it could also be amazingly useful to both the core DefraDB team and other use cases (Source and non-Source). The long term scope of the project is fairly unbounded, and it will likely be very fun to work on. We should be cautious in how much time we invest in this.

Useful links

islamaliev commented 5 months ago

That is an interesting task.

We could also use the tool with our random doc generator.

But generating random tests is one thing and a challenge on it's own. Another thing is to write correct test expectations for generated tests. One of them might be running an analysis over generated Arrange and Act part (with AI?) and determine Assert part. There are a few options there that we can discuss.

AndrewSisley commented 5 months ago

Note before disappearing for the weekend and potentially getting caught up in bug bashing and normal dev next week:

I am closing on an initial plan for this in my branch 2545-proactive-tests. The comments in add_schema_test.go are the most valuable thing in that branch so far. It may not be particularly readable to others, or my medium-long term future self atm, so I should try and tidy up my thoughts at the start of the next sprint if I don't get enough time to do so before.

AndrewSisley commented 4 months ago

I've updated the issue description with my thoughts from during the cooldown, posting in this comment too to preserve the history

Write up

# Proactive tests At the moment all our tests are reactive, they are written by devs guessing at ways in which the database can fail. We should add some fancy magic to generate tests in order to seek out bugs, kind of like an advanced form of fuzz testing. At least at first, it will be limited to comparing results between two branches (e.g. a refactor branch, and develop) - this would largely remove the need to predict (or interact with) the results of actions. ## Project structure The project is split into four distinct parts; firstly there is the set of supported action types, second the general purpose generator(s), third is the engine responsible for orchestrating everything related to generating tests, fourth is the test executor(s) responsible for executing generated tests. ### Actions Each supported action type will implement an common interface with two functions. The first function (called `Entropy` for now) takes in a set of parameters (potentially an actual `[]any`) (the first param would likely be the DefraDB instance, to allow accounting for existing state) and returns the number of possible valid and relevant action-values that could be generated. For example, given a `AddSchema` action-type, and assuming an empty database for simplicity, there would be two parameters - the number of schema to generate (`c`), and, the maximum number of fields per schema (`n`). The number of potential scalar field types is a constant (`NFTypes`). Field names are not relevant/low-value to our testing and can be ignored as a variable, and we will not worry about directives yet. This gives us the function `n ^ (NFTypes+2c)` (`2c` represents possible relation fields, one for the object and one for the id field). This entropy value allows the generator to encode any possible combination of exclusively useful valid values into binary format without needing to know the specifics of the action - as long as the generated code is equal to or smaller than the entropy, it will be valid and unique (uniqueness being desirable to avoid unwanted bias in action generation). In some cases entropy may become unreasonably large (e.g. generating docs containing lots of large fields), in which case the value returned from `Entropy` would need to represent a sampling of all potentially useful test values, not the actual true entropy. The second function (called `ConstructValue` for now) takes the encoded value generated by the generator (using `Entropy`) and decodes it into a valid test action. The encoding would likely something similar to a BaseX numeric system, where each digit is treated as a bitmap item. The X in BaseX is variable depending on the location (e.g. when representing a field type on a single collection it would currently be Base16 (`NFTypes`)). The X does not need to be an even number and can be as large as we like. It is likely that encoding/decoding values will require us to support mathematical operations on arbitrarily large integers. There appears to be no official Go packages for this, so we may have to write our own. If we do this, we should bear in mind the potential usefulness of such a package to production DefraDB and build accordingly. #### Note It would I think be theoretically be possible to generate the entropy for the entire test case given a set of initial limiting params, however that seems like it would be an unreasonably large value and it would make encoding/decoding quite a lot more complicated. I might be wrong however and it would be amazing if we could do that. ### Generators By removing correctness and bias as problems from value generation, selecting values becomes much simpler and the code should abstracted away allowing multiple generators to be implemented and swapped out. Initially we can likely get away with using a simple random `[]byte` generator, however long term we can experiment with replacing this with more fancy generators such as neural networks - potentially allowing us to predict actions most likely to catch bugs. ### Engine The action generation engine would coordinate the active generator, and test actions. It will likely take very few (if any) parameters, other than perhaps the action-types and generators to use, and output test action sets. It should not be responsible for executing the actions - by splitting test generation and execution we allow the test generation aspect to be used for multiple means, initially that would just mean refactoring tests, but in the future it could include benchmarking and other interesting cases. ### Executors The generated tests will need to be executed in order to be useful, initially I suggest we focus on comparative testing between branches to ensure that existing behaviour is preserved. It is important that the tests executed are recorded somewhere (initially just in the terminal perhaps) so that they can be replayed and examined outside of this action-generation system - otherwise the usefulness of the system would be greatly reduced. It seems likely that building the first executor will involve extracting shared stuff from the core integration test suite so that they may be shared between them. This will likely also be of benefit to the benchmark suite. ## Warning I am uncertain as to how useful this will end up being, it could prove to be too unreliable or costly to run, but it could also be amazingly useful to both the core DefraDB team and other use cases (Source and non-Source). The long term scope of the project is fairly unbounded, and it will likely be very fun to work on. We should be cautious in how much time we invest in this. ## Useful links - https://www.cuemath.com/numbers/binary-multiplication/ - https://stackoverflow.com/questions/30025185/karatsuba-multiplication-optimization-on-byte-array?rq=4 - https://github.com/IBM/graphql-query-generator

sourcenetwork / defradb

Introduce proactive test suite #2545