Binary Reproduceability

kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.

https://kuzudb.com/

MIT License

1.42k stars 98 forks source link

Binary Reproduceability #3484

Open benjaminwinger opened 6 months ago

benjaminwinger commented 6 months ago

In addition to testing databases created on one platform on different platforms, it would be useful to do direct binary comparisons between the databases (when produced in single threaded mode). That will help cover gaps in the coverage of the binary database tests, and can also enforce that kuzu's output is deterministic and the data is being properly initialized.

It could be implemented by creating a new database alongside the binary database tests, and comparing them with something like diff -r from GNU diffutils. An older version of diffutils is available for windows here (or via chocolatey).

mewim commented 6 months ago

If the database is created under different environment, is the binary directly comparable? Could it depend on parallelism, etc?

benjaminwinger commented 6 months ago

It can depend on parallelism when copying, but if we always create databases in single-threaded mode when doing such comparisons it should be able to be directly comparable. The only differences (except for endianness, but in that case the databases won't be compatible at all) should be the order that the input data is processed when copying, but if we do it single-threaded that should be consistent. Or should be able to be at least. Maybe it turns out that we're doing something inherently non-deterministic, but the databases seemed to be almost identical between the different platforms when I was debugging kuzudb/explorer#129, and it shouldn't be very complicated to set up so I think we should see if it works.

benjaminwinger commented 6 months ago

As noted in #3501, the catalog stores and serializes catalog entries using the order produced by an std::unordered_map, which is not consistent across stdlibs. If we want to make them consistent, we could either use an ordered map, or sort the entries by name before serializing them.

ray6080 commented 6 months ago

If we want to make them consistent, we could either use an ordered map, or sort the entries by name before serializing them.

Serialize catalog entries in a certain order makes more sense to me.

benjaminwinger commented 6 months ago

I've found that one way of ensuring that directly writing a struct to disk works consistently is to assert that it has a unique object representation via std::has_unique_object_representations_v. Among other things it ensures there is no padding (which isn't guaranteed to be zeroed and particularly in release mode will usually be uninitialized). Unfortunately it doesn't allow floats/doubles because their spec allows for multiple representations of equal values, which limits its usefulness. I suspect the float/double ambiguity is generally fine though, as I doubt that we will encounter issues with, for example, zero being stored as negative zero on a different platform and breaking binary equality.