replikativ / datahike

A fast, immutable, distributed & compositional Datalog engine for everyone.
https://datahike.io
Eclipse Public License 1.0
1.62k stars 95 forks source link

Separate unit tests for index types #580

Closed jsmassa closed 1 year ago

jsmassa commented 1 year ago

Addresses #520

Turns all default values for a database into dynamic vars, so they can be set in tests.edn to test specific information.

The idea is to use the default configuration for all tests, so each test has to be written only once.

Problematic with this strategy are

For these cases, we would need some kind of translation, similar to the way it is done in the test/datahike/test/attribute-refs test namespaces. If possible, I would like to get rid of these test namespaces as they only contain a subset of the tests and are cumbersome to maintain. Also, we would need to rewrite a lot of the tests to not check for the entity id but for example run queries as it is already done in some updated tests

Unproblematic, however, are config options such like

Please, tell me if this strategy makes sense or if there is some problem with it that I have overlooked.

I am also interested to hear which config options you think should be tested, or which combinations you think would make sense.

whilo commented 1 year ago

The strategy makes sense and normalizing the tests like this must be the key strategy to avoid duplication and arbitrary symmetry breaking (i.e. having different tests for different configurations). Thanks for addressing this @jsmassa !

I am not familiar enough with the attribute-ref mismatches. For schema settings I would suggest to always assume the stricter requirements (schema-on-write) and then potentially skip it automatically for schema-on-read. This can be done with a function in the test namespaces that dispatches on the schema configuration. Do you think this would work? Would this also be possible for the attribute-refs?

Also, is there still a reason to have attribute-refs only optionally? Edit: Yes, of course, because they depend on schema-on-write. But does this mean that when we always provide a schema all tests can be run with attribute-refs?

whilo commented 1 year ago

A related serious problem I just stumbled over in #332 is that many tests use the same database, which worked so far for the memory database because the connections were not (properly) shared. It would be good if the db :id for the memory database could be automatically retrieved from the test name. Can we address this here as well?

Edit: I have solved this there for now by manually duplicating the connections for the tests.

jsmassa commented 1 year ago

The schema isn't really a problem. Yes, we can always add a schema, but we could also only add it when required. That doesn't make much of a difference, I think. The most problematic thing is still the entity ID which is being checked in a lot of the tests, so there is heaps of rewriting to do. Entity IDs for user entities are shifted using attribute references because system attributes like :db/ident or :db/valueType also need an entity ID and those are added to the database before the first user entities can be transacted.

jsmassa commented 1 year ago

The schema isn't really a problem. Yes, we can always add a schema, but we could also only add it when required. That doesn't make much of a difference, I think. The most problematic thing is still the entity ID which is being checked in a lot of the tests, so there is heaps of rewriting to do. Entity IDs for user entities are shifted using attribute references because system attributes like :db/ident or :db/valueType also need an entity ID and those are added to the database before the first user entities can be transacted.

whilo commented 1 year ago

Could we just grab max-eid at the beginning of the tests and use this as the offset?

jsmassa commented 1 year ago

Could we just grab max-eid at the beginning of the tests and use this as the offset?

This is approximately how I did it for the attribute reference tests. Some of the tests used specific input datoms though in addition to the expected output datoms, so the input had to be manipulated as well. This lead me to write some helper functions to shift collections of datoms according to the amount of the system entity datoms. So far, there are also separate functions to shift the value as well which is needed for attributes with reference value type. Of course, this can be made simpler with a more complex helper function detecting when the value is a reference and has to be shifted.

In some instances there might not be a way around using functions to shift the entity IDs to the correct values, but I would say that in the end, a database user will not care about which ID an entity gets allocated inside the database or if the attributes are saved as references or keywords. This is why I think that, ideally, we should avoid checking those details for high-level functions like transact, q or pull.

I would say that the strategies I used for the attribute reference tests in test/datahike/test/attribute_refs/ are ok and - with some improvements - can be used similarly to refactor most tests to work with all configurations. I would hope though that we can find a reformulation of a lot of the tests that make them more focused on the actual fact they are challenging and better readable than they would be with a habitual shifting of numbers.

whilo commented 1 year ago

Could we just grab max-eid at the beginning of the tests and use this as the offset?

This is approximately how I did it for the attribute reference tests. Some of the tests used specific input datoms though in addition to the expected output datoms, so the input had to be manipulated as well. This lead me to write some helper functions to shift collections of datoms according to the amount of the system entity datoms. So far, there are also separate functions to shift the value as well which is needed for attributes with reference value type. Of course, this can be made simpler with a more complex helper function detecting when the value is a reference and has to be shifted.

In some instances there might not be a way around using functions to shift the entity IDs to the correct values, but I would say that in the end, a database user will not care about which ID an entity gets allocated inside the database or if the attributes are saved as references or keywords. This is why I think that, ideally, we should avoid checking those details for high-level functions like transact, q or pull.

I would say that the strategies I used for the attribute reference tests in test/datahike/test/attribute_refs/ are ok and - with some improvements - can be used similarly to refactor most tests to work with all configurations. I would hope though that we can find a reformulation of a lot of the tests that make them more focused on the actual fact they are challenging and better readable than they would be with a habitual shifting of numbers.

I agree that we should not overfit to specific eids and rewrite tests to not rely on them in general is the right way forward.