RFC: incremental json fingerprinting

phritz commented 4 years ago

Status as of Oct 16, 2020: general agreement we should do this, just not clear when

Background/Rationale

A replicache client view is a key/value map of UTF-8 strings to json bytes. During pull, the diffserver requests a fresh, full client view from the customer's data layer. It diffs the new client view against a local copy of the client's current client view, and returns the update in the form of a delta "patch" to the client. The diffserver also computes a checksum of the new client view and returns it with the patch to the client so the client can verify that the patch was correctly applied.

Computing a checksum over JSON bytes is challenging because it is legal for encoders to encode the same data in different ways. Potential differences include (more detail here):

whitespace (eg, {"foo":true} vs { "foo": true })
order of fields in an object
number encoding (eg, "1e3" vs "1000")
string encoding (eg, "\n" vs "\u000A")
differences in UTF-8 encoding

The diffserver solves this problem by encoding json to a canonical form before generating the diff and computing the checksum.

However, some customers don't need or want to return their full client view to the client via the diffserver, for example the client might be able to easily generate patches of their own. We would like to enable these customers to return a patch directly from their data layer to the client. Further, we would like to support field-level json patches of values, as opposed to the key-level replacement currently provided. In doing so, we need to ensure that replicache keeps its promise that the client faithfully tracks the server's state.

In order to keep this promise, we would like to check that after the client applies the (potentially field-level) patch, that the client and the server have the same map. We cannot use the current checksumming scheme used by the diffserver because (A) it operates at the level of keys and values and therefore does not support json patch without requiring a copy of the full map being patched and (B) it requires json canonicalization which is neither standardized nor universally available, and requires a decent chunk of code to implement.

Problem statement

Devise an efficient way to determine whether the contents of two maps from strings to JSON bytes are the same over the network such that it:

is tolerant of differences in JSON encoding
is incrementally computable from a json patch
does not require keeping a copy of the entire map
the implementation is small, say ~100 lines of code or less, so that it is easy for us or our customers to implement in their language of choice

Concretely, we want to enable the customer to keep track of the contents of a map without having to store it, and to be able to communicate the expected change to the map given a json patch.

The solution does NOT have to be perfect: it's acceptable that the solution detects important classes of differences between two maps (eg, differing keys) with reasonable probability.

Proposal

Define a JSON fingerprint algorithm that generates a short content fingerprint given JSON bytes. The fingerprint function f should be incrementally computable, that is, given a map M with fingerprint F we can compute the fingerprint of patch p applied to M from p and F alone. That is, there exists an f' such that f(p(M)) == f'(p, F). Note that f' does not require the map as input: it is a function of the pre-patched fingerprint and the patch.

If we have such a fingerprint algorithm then using it is straightforward. The customer's data layer keeps the current fingerprint of the map. When it receives or generates a patch to the map it computes the expected post-patch fingerprint from the current fingerprint and the patch. It returns this expected fingerprint along with the patch to the client and sets the expected fingerprint as its current value. The client applies the patch and computes the fingerprint of the resulting map, checking that it has the expected value.

Note that replicache's client view map is not JSON, being a map from string keys to JSON bytes. However it's a simple matter to treat the keys as properties in a top-level JSON object, so from here on out we treat the client view map as JSON.

There are two problems to be solved in devising the fingerprint:

How to generate a fingerprint of plain old JSON data (null, boolean, number, string) in a way that is tolerant of differences in encoding.
How to generate a fingerprint of composite JSON data (object, array) such that given a patch to it we can efficiently compute the map's expected fingerprint from only the patch and the current fingerprint.

Fingerprint of plain old data

We'd like to fingerprint plain old data such that important classes of difference in the data likely results in a difference in its fingerprint. We fingerprint as follows, given a fast, universally available, byte-wise "hash" function h like crc32:

null, true, false: for these values f has constant fingerprints f=h("null"), h("true"), and h("false") respectively
string: f=h(NFD(string))
- JSON strings are already UTF-8 encoded as are the map's keys. However between two implementations strings could be escaped in different but equivalent ways, or expressed as different but equivalent UTF-8. Thankfully, there exists a normalized unicode form that is standardized, fast, and widely available. We therefore canonicalize the string to NFD before applying h.
number: f=h(little_endian(int32(ensure_exact_represenation(n))))
- ensure_exact_represenation(n): if -16777216 <= n <= 16777216 (the range that can be exactly represented as an integer with a single precision float) return n; if n > 16777216 return 16777216; if n < -16777216 return -16777216.
- int32(n): do a truncating conversation to an int32 (ie, drop the exponent and round down)
  - numbers are tricky. They might be written in exponential notation or not, and with the decimal placed arbitrarily. There might be different results when printing them in different languages and there is no guarantee that numbers parse to the same binary representation in different languages. We don't want to do too much with a number we have parsed because of differences in floating point settings on different machines/platforms. The suggested fingerprint above ignores fractional values and values larger than 2^24 :( I would love to have a better strategy here -- suggestions?

Fingerprint of composite data

Assume fingerprints are fixed size (eg, 4 bytes if we use crc32 for h above). We would like define the fingerprint of a composite type (array, object) to be a function of the fingerprints of its members. Members might be removed, added, or modified as part of patch, so we'd like to define the composite fingerprint in such a way that it is easy to compute the fingerprint resulting from removing, adding, or modifying a member. A simple function that makes this easy is xor, which is both its own inverse and commutative:

a composite's fingerprint is the xor of the fingerprints of all its members: f_composite = f_member_1 xor f_member_2 ...
to add a member xor its fingerprint into the composite's: f_composite = f_composite xor f_added_member
to subtract a member xor its fingerprint into the composite's: f_composite = f_composite xor f_removed_member
to modify a member subtract is old fingerprint and add its new fingerprint: f_composite = f_composite xor f_old_member xor f_new_member

There are a few things we could add to the above to help us distinguish between composites that have the same members but in a different arrangement (eg, different property names but the same values or an array with the same elements in a different order). For objects, we could include the property name in the fingerprint calculation, eg f=h(property_name||value_fingerprint) where || is concatenation. For arrays, we could include each item's ordinal, eg f=h("0"||f_array[0]||"1"||f_array[1]...). Unclear if this is worthwhile.

Incremental fingerprinting

One of the primary problems we're trying solve is to enable the customer to compute the fingerprint of the map once a patch has been applied, without having to keep the whole map. This is how we do it.

Let's call the points in JSON where a value can be added, modified, or replaced to be patch points. Patch points for json pointer are named fields in objects and positions in arrays. For each patch point we need to keep the existing fingerprint of its member (value) and the fingerprint of the composite that it belongs to (because its member might change). When we add or remove the value at the patch point we add or subtract its fingerprint from its composite's. However since the composite's fingerprint changed, this necessarily changes the fingerprint of the composite if any that it belongs to, and so up the tree to the root (top-level object aka map).

In order to compute the incremental fingerprint for the patch you need the fingerprints of the patch point up through the root, as well as the fingerprint of the patched value if adding or replacing. For example to patch /foo/1/bar/3/baz we would have to keep the records:

{
   "/": <4 byte value>,
   "/foo": <4 byte value>,
   "/foo/1": <4 byte value>,
   "/foo/1/bar": <4 byte value>,
   "/foo/bar/3": <4 byte value>,
   "/foo/1/bar/3/baz": <4 byte value>,
}

where the 4-byte-value is the fingerprint. Let's call this map the patch table.

To replace /foo/1/bar/3/baz we'd compute its new value from the patch, set the fingerprint of the new value in the patch table, subtract its old fingerprint from its parent's, and then add its new fingerprint to its parent's. We'd do this all the way up to the root.

When a value is removed from a patch point, we can remove the patch point from the patch table.

Scale of the patch table

The customer's data layer has to persist a patch table for each client. This is roughly 10-20 bytes per entry times the number of patch points. Since with full json pointer syntax any array index is a potential patch point, this number of patch points could grow very quickly. At the very least we should probably limit patch points to be fields of objects, or perhaps we should limit depth, or require that the patch points be declared. Thoughts?

To get an idea of the sizes we're talking about here 1,000 patch points would be 10-20KB uncompressed.

With even just a little engineering (eg a huffman-coding-like approach where we keep a dictionary of common paths and substitute them for $1 $2 etc) we can probably get the size down to 20% of the above numbers, meaning a few kilobytes per 1000 points which could then be compressed.

Pseudocode sketches

import library // These functions should be already available on all platforms

type fingerprint uint32

// On a full sync or to verify the full table call this. The map returned is from json pointer path
// to fingerprint. The full map fingerprint is stored at key "".
func compute_fingerprint_from_scratch(value: JSON) -> map[string => fingerprint] {
  var patch_table: map[string => fingerprint]
  var (_, _) = compute_fingerprint("", patch_table, value)
  return patch_table
}

func apply_patch(op: string, path: string, value: JSON, patch_table: map[string => fingerprint]) {
  switch op:
    case "add":
      var (_, fp: fingerprint) = compute_fingerprint(path, patch_table, value)
      patch_table[path] = fp
      do while path != ""
        path = path[0..path.lastIndexOf("/")]   // up a level
        patch_table[path] = patch_table[path] xor fp
        fp = patch_table[path]

    case "remove":
      var fp: fingerprint = patch_table[path]
      del(patch_table[path])
      do while path != ""
        path = path[0..path.lastIndexOf("/")]   // up a level
        patch_table[path] = patch_table[path] xor fp
        fp = patch_table[path]
        // TODO might want to del patch table entries going up if there is no patch point
        //      at this level anymore

    case "replace":
      var old_fp: fingerprint = patch_table[path]
      var (_, new_fp: fingerprint) = compute_fingerprint(path, patch_table, value)
      patch_table[path] = new_fp
      do while path != ""
        path = path[0..path.lastIndexOf("/")]   // up a level
        var temp: fingerprint = patch_table[path]
        patch_table[path] = path_table[path] xor old_fp xor new_fp
        old_fp = temp
        new_fp = patch_table[path]
}

// h is our "hash" function
func h(buf: []byte) -> fingerprint {
   return library.crc32(buf)
}

// - path is the json pointer path to this value.
// - patch_table is a map from json pointer path to fingerprint. Both patch points 
//    and their containing composites are included. Not every path is included in 
//    this table: a path is included in the table if one of its children has a patch point.
// - value is the JSON value to compute the fingerprint for
// 
// - bool return value indicates whether the value contains patch points, in which 
//    case the caller should add patch table entries all the way up the path as it returns
// - fingerprint returned is the fingerprint of the value passed in
func compute_fingerprint(path: string, patch_table: map[string => fingerprint], value: JSON) -> (bool, fingerprint) {
  var fp: uint32
  var add_to_table: bool = false

  switch value.type:
    case null:
      fp = h("null")

    case bool:
      fp = h(value ? "true" : "false")

    case number:
      fp = h(as_bytes_little_endian(small_integer(value)))

    case string:
      fp = h(library.NFD(value))

    case array: 
      // Assume we don't insert patch points for each index.
      for i = 0...len(value):
        var member_path: string = path + "/{i}"
        var (has_patch_points: bool, member_fp: fingerprint) = compute_fingerprint(member_path, patch_table, value[i])
        fp = fp xor member_fp
        // If it has patch points we want to add this specific member's fingerprint, but also the composite itself.
        if has_patch_points 
          patch_table[member_path] = member_fp 
          add_to_table = true 

    case object: 
      for (property: string, member: JSON) = value.properties()
        var member_path: string = path + "/{property}"
        var (has_patch_points: bool, member_fp: fingerprint) = compute_fingerprint(member_path, patch_table, value)
        fp = fp xor member_fp
        // Assuming we want to add patch points for each property
        patch_table[member_path] = member_fp 
        add_to_table = true 

  if add_to_table || path == ""
    patch_table[path] = fp

  return (add_to_table, fp)
}

func small_integer(n: float) -> uint32 {
  if n > 16777216
    return 16777216
  if n < -16777216
    return -16777216
  n = round_down(n)
  return (uint32) n
}

arv commented 4 years ago

numbers: Why don't we use -(2 ** 52 - 1) > x > 2 ** 52 - 1 which is the range of integers that can be safely expressed in float64

arrays: _"For arrays, we could include each item's ordinal, eg f=h("0"||f_array[0]||"1"||farray[1]...). Unclear if this is worthwhile." Another reason to include the ordinal is that A xor A is 0 so if an an array has repeated values those gets "ignored".

But including the index has a problem when an item is inserted in the middle of the array, which shifts everything which would not work because the indexes changes for the tail of the array.

aboodman commented 4 years ago

This is awesome.

Strings

I am concerned about expense of unicode normalization. I don't have an idea how expensive it is. Does it require allocation? Could this end up meaning we have to copy the entire client view when we do a full sync (maybe that's ok because we do it infrequently?). Anyway, could we use a similar trick to what you did with number, where we return a constant (or maybe the prefix up to the the first non-normal char) for strings that aren't already normalized? I did a quick check and it seems like it is usually possible to quickly check whether a string is nfd: https://docs.rs/unicode-normalization/0.1.13/unicode_normalization/fn.is_nfd_quick.html

Numbers

I think what you have done is actually really nice and strikes the right balance between complexity and sensitivity to change. If we can do incrementally better over time (e.g. with @arv suggestion) then great, but otherwise .. meh

Patch Points

Some observations:

This approach reminds me of prolly trees and we could almost use prolly trees, which is tempting. Prolly trees turn sequences into a tree so you'd only need store log(n) hashes, not n. However, in order to incrementally modify a prolly tree, you do need access to the original prolly tree because if you insert at an arbitrary position, you need the data around the insertion point up to the next chunk boundary on either side. I don't see any way around that.
I really like how flexible this system is to choosing different level of patch detail.
The patch table never has to be transmitted it is simply stored on either side -- so it sort of doesn't matter how big it is
Unfortunately customer must store multiple patch tables right? Or maybe one per-client?
It's possible, theoretically for the server to return a patch that client cannot apply because it doesn't have the right patch table, not because the patch itself doesn't apply! In this case client must request full sync.

I like the idea of restricting to fields, and further only top-level values to start. If users require more granular patches we can come up with some way for them to express that. I really like that this mechanism is flexible for the future but we don't have to actually take advantage of its power right now.

aboodman commented 4 years ago

Unclear if this is worthwhile.

For structs I think it is, but for arrays it doesn't seem to work.

rocicorp / repc