Open phritz opened 4 years ago
numbers: Why don't we use -(2 ** 52 - 1)
> x > 2 ** 52 - 1
which is the range of integers that can be safely expressed in float64
arrays: _"For arrays, we could include each item's ordinal, eg f=h("0"||f_array[0]||"1"||farray[1]...). Unclear if this is worthwhile." Another reason to include the ordinal is that A xor A is 0 so if an an array has repeated values those gets "ignored".
But including the index has a problem when an item is inserted in the middle of the array, which shifts everything which would not work because the indexes changes for the tail of the array.
This is awesome.
Strings
I am concerned about expense of unicode normalization. I don't have an idea how expensive it is. Does it require allocation? Could this end up meaning we have to copy the entire client view when we do a full sync (maybe that's ok because we do it infrequently?). Anyway, could we use a similar trick to what you did with number, where we return a constant (or maybe the prefix up to the the first non-normal char) for strings that aren't already normalized? I did a quick check and it seems like it is usually possible to quickly check whether a string is nfd: https://docs.rs/unicode-normalization/0.1.13/unicode_normalization/fn.is_nfd_quick.html
Numbers
I think what you have done is actually really nice and strikes the right balance between complexity and sensitivity to change. If we can do incrementally better over time (e.g. with @arv suggestion) then great, but otherwise .. meh
Patch Points
Some observations:
I like the idea of restricting to fields, and further only top-level values to start. If users require more granular patches we can come up with some way for them to express that. I really like that this mechanism is flexible for the future but we don't have to actually take advantage of its power right now.
Unclear if this is worthwhile.
For structs I think it is, but for arrays it doesn't seem to work.
Status as of Oct 16, 2020: general agreement we should do this, just not clear when
Background/Rationale
A replicache client view is a key/value map of UTF-8 strings to json bytes. During pull, the diffserver requests a fresh, full client view from the customer's data layer. It diffs the new client view against a local copy of the client's current client view, and returns the update in the form of a delta "patch" to the client. The diffserver also computes a checksum of the new client view and returns it with the patch to the client so the client can verify that the patch was correctly applied.
Computing a checksum over JSON bytes is challenging because it is legal for encoders to encode the same data in different ways. Potential differences include (more detail here):
{"foo":true}
vs{ "foo": true }
)The diffserver solves this problem by encoding json to a canonical form before generating the diff and computing the checksum.
However, some customers don't need or want to return their full client view to the client via the diffserver, for example the client might be able to easily generate patches of their own. We would like to enable these customers to return a patch directly from their data layer to the client. Further, we would like to support field-level json patches of values, as opposed to the key-level replacement currently provided. In doing so, we need to ensure that replicache keeps its promise that the client faithfully tracks the server's state.
In order to keep this promise, we would like to check that after the client applies the (potentially field-level) patch, that the client and the server have the same map. We cannot use the current checksumming scheme used by the diffserver because (A) it operates at the level of keys and values and therefore does not support json patch without requiring a copy of the full map being patched and (B) it requires json canonicalization which is neither standardized nor universally available, and requires a decent chunk of code to implement.
Problem statement
Devise an efficient way to determine whether the contents of two maps from strings to JSON bytes are the same over the network such that it:
Concretely, we want to enable the customer to keep track of the contents of a map without having to store it, and to be able to communicate the expected change to the map given a json patch.
The solution does NOT have to be perfect: it's acceptable that the solution detects important classes of differences between two maps (eg, differing keys) with reasonable probability.
Proposal
Define a JSON fingerprint algorithm that generates a short content fingerprint given JSON bytes. The fingerprint function f should be incrementally computable, that is, given a map M with fingerprint F we can compute the fingerprint of patch p applied to M from p and F alone. That is, there exists an f' such that f(p(M)) == f'(p, F). Note that f' does not require the map as input: it is a function of the pre-patched fingerprint and the patch.
If we have such a fingerprint algorithm then using it is straightforward. The customer's data layer keeps the current fingerprint of the map. When it receives or generates a patch to the map it computes the expected post-patch fingerprint from the current fingerprint and the patch. It returns this expected fingerprint along with the patch to the client and sets the expected fingerprint as its current value. The client applies the patch and computes the fingerprint of the resulting map, checking that it has the expected value.
Note that replicache's client view map is not JSON, being a map from string keys to JSON bytes. However it's a simple matter to treat the keys as properties in a top-level JSON object, so from here on out we treat the client view map as JSON.
There are two problems to be solved in devising the fingerprint:
Fingerprint of plain old data
We'd like to fingerprint plain old data such that important classes of difference in the data likely results in a difference in its fingerprint. We fingerprint as follows, given a fast, universally available, byte-wise "hash" function h like crc32:
Fingerprint of composite data
Assume fingerprints are fixed size (eg, 4 bytes if we use crc32 for h above). We would like define the fingerprint of a composite type (array, object) to be a function of the fingerprints of its members. Members might be removed, added, or modified as part of patch, so we'd like to define the composite fingerprint in such a way that it is easy to compute the fingerprint resulting from removing, adding, or modifying a member. A simple function that makes this easy is xor, which is both its own inverse and commutative:
There are a few things we could add to the above to help us distinguish between composites that have the same members but in a different arrangement (eg, different property names but the same values or an array with the same elements in a different order). For objects, we could include the property name in the fingerprint calculation, eg f=h(property_name||value_fingerprint) where || is concatenation. For arrays, we could include each item's ordinal, eg f=h("0"||f_array[0]||"1"||f_array[1]...). Unclear if this is worthwhile.
Incremental fingerprinting
One of the primary problems we're trying solve is to enable the customer to compute the fingerprint of the map once a patch has been applied, without having to keep the whole map. This is how we do it.
Let's call the points in JSON where a value can be added, modified, or replaced to be patch points. Patch points for json pointer are named fields in objects and positions in arrays. For each patch point we need to keep the existing fingerprint of its member (value) and the fingerprint of the composite that it belongs to (because its member might change). When we add or remove the value at the patch point we add or subtract its fingerprint from its composite's. However since the composite's fingerprint changed, this necessarily changes the fingerprint of the composite if any that it belongs to, and so up the tree to the root (top-level object aka map).
In order to compute the incremental fingerprint for the patch you need the fingerprints of the patch point up through the root, as well as the fingerprint of the patched value if adding or replacing. For example to patch
/foo/1/bar/3/baz
we would have to keep the records:where the 4-byte-value is the fingerprint. Let's call this map the patch table.
To replace
/foo/1/bar/3/baz
we'd compute its new value from the patch, set the fingerprint of the new value in the patch table, subtract its old fingerprint from its parent's, and then add its new fingerprint to its parent's. We'd do this all the way up to the root.When a value is removed from a patch point, we can remove the patch point from the patch table.
Scale of the patch table
The customer's data layer has to persist a patch table for each client. This is roughly 10-20 bytes per entry times the number of patch points. Since with full json pointer syntax any array index is a potential patch point, this number of patch points could grow very quickly. At the very least we should probably limit patch points to be fields of objects, or perhaps we should limit depth, or require that the patch points be declared. Thoughts?
To get an idea of the sizes we're talking about here 1,000 patch points would be 10-20KB uncompressed.
With even just a little engineering (eg a huffman-coding-like approach where we keep a dictionary of common paths and substitute them for $1 $2 etc) we can probably get the size down to 20% of the above numbers, meaning a few kilobytes per 1000 points which could then be compressed.
Pseudocode sketches