Generating a 64-bit Rabin Fingerprint (as recommended in the Avro * spec) of a byte string

linkedin / goavro

Apache License 2.0

983 stars 219 forks source link

Generating a 64-bit Rabin Fingerprint (as recommended in the Avro * spec) of a byte string #35

Closed astrekalova closed 6 years ago

astrekalova commented 9 years ago

Avro specification contains description of generating schema fingerprints. Is this functionality supported at all in the goavro package? Can you recommend the best way to generate a schema fingerprint?

karrick commented 8 years ago

SHA256 is my personal favorite, because IIRC it's more performant than SHA128, and assuming you want to store this long of a fingerprint.

fingerprint := fmt.Sprintf("%x", sha256.Sum256(someSchemaString))

Here are a few more examples displaying the MD5, SHA1, and SHA256 fingerprints of the same string.

import (
    "crypto/md5"
    "crypto/sha1"
    "crypto/sha256"
    "fmt"
)

func example(someSchemaString string) {
    fmt.Printf("MD5:\t%x\n", md5.Sum(someSchemaString))
    fmt.Printf("SHA1:\t%x\n", sha1.Sum(someSchemaString))
    fmt.Printf("SHA256:\t%x\n", sha256.Sum256(someSchemaString))
}

strangesast commented 6 years ago

Do you have an example of using one of these alternative fingerprint methods with a kafka serializer/deserializer? I am currently using the avro-cli generated classes but it appears the fingerprint method can't be configured (hard-coded here).

~~I would just use the default "CRC-64-AVRO" method, and it's described pretty well in the avro spec but I've had trouble implementing it in go.~~ It's actually trivial to implement the fingerprinting function used by the generated classes described here. This is particularly useful when not using a schema registry.

karrick commented 6 years ago

The fingerprint function referenced in the comment above is one of many hash functions that can be used to convert a long JSON string to a fixed size byte array that hopefully uniquely identifies the schema.

The difficulty is not so much which hash function is used, but the process of converting a non-canonical form of the JSON schema to a canonical form of the JSON schema.

https://avro.apache.org/docs/current/spec.html#Transforming+into+Parsing+Canonical+Form

This work has not been done yet for this particular library.

karrick commented 6 years ago

I'm taking a look at adding a feature to output the canonicalized JSON version of the schema. The goal is the output of this upcoming function or method will be able to be used as the input for any desired hashing algorithm to get a schema ID.

karrick commented 6 years ago

I have merged in some work from a contributor to convert a valid Avro schema to Parsing Canonical Form, and I implemented the CRC-64-AVRO hash suggested in the Avro Specification. Sadly I was unable to find any sample fingerprints using the custom hash function to add to the test suite, so that code has not been merged into the mainline branch yet.

While looking for the sample fingerprints I did find the Avro Resolution Canonical Form project which aims to complement Parsing Canonical Form by adding a few necessary rules required to fully disambiguate schemas based also on their default and alias properties.

I suppose the proper course of action is to consider adding the ability to export a schema's Avro Resolution Canonical Form.

strangesast commented 6 years ago

Is the solution as simple as rewriting the following (java class) in go? https://github.com/apache/avro/blob/2bbb99602e9e925058ead86fc8ac4e27055b05d6/lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java#L35-L45

I made use of that in this project to manually generate the parsing form for the avro schemas I need.

I could write a go implementation myself by copying from the java source but it'd be ugly.

Oh, and here are a few sample schemas and their fingerprints https://github.com/apache/avro/blob/17f2d75132021fafeca29edbdcade40df960fdc9/share/test/data/schema-tests.txt