Closed astrekalova closed 6 years ago
SHA256 is my personal favorite, because IIRC it's more performant than SHA128, and assuming you want to store this long of a fingerprint.
fingerprint := fmt.Sprintf("%x", sha256.Sum256(someSchemaString))
Here are a few more examples displaying the MD5, SHA1, and SHA256 fingerprints of the same string.
import (
"crypto/md5"
"crypto/sha1"
"crypto/sha256"
"fmt"
)
func example(someSchemaString string) {
fmt.Printf("MD5:\t%x\n", md5.Sum(someSchemaString))
fmt.Printf("SHA1:\t%x\n", sha1.Sum(someSchemaString))
fmt.Printf("SHA256:\t%x\n", sha256.Sum256(someSchemaString))
}
Do you have an example of using one of these alternative fingerprint methods with a kafka serializer/deserializer? I am currently using the avro-cli generated classes but it appears the fingerprint method can't be configured (hard-coded here).
I would just use the default "CRC-64-AVRO" method, and it's described pretty well in the avro spec but I've had trouble implementing it in go. It's actually trivial to implement the fingerprinting function used by the generated classes described here. This is particularly useful when not using a schema registry.
The fingerprint function referenced in the comment above is one of many hash functions that can be used to convert a long JSON string to a fixed size byte array that hopefully uniquely identifies the schema.
The difficulty is not so much which hash function is used, but the process of converting a non-canonical form of the JSON schema to a canonical form of the JSON schema.
https://avro.apache.org/docs/current/spec.html#Transforming+into+Parsing+Canonical+Form
This work has not been done yet for this particular library.
I'm taking a look at adding a feature to output the canonicalized JSON version of the schema. The goal is the output of this upcoming function or method will be able to be used as the input for any desired hashing algorithm to get a schema ID.
I have merged in some work from a contributor to convert a valid Avro schema to Parsing Canonical Form, and I implemented the CRC-64-AVRO hash suggested in the Avro Specification. Sadly I was unable to find any sample fingerprints using the custom hash function to add to the test suite, so that code has not been merged into the mainline branch yet.
While looking for the sample fingerprints I did find the Avro Resolution Canonical Form project which aims to complement Parsing Canonical Form by adding a few necessary rules required to fully disambiguate schemas based also on their default
and alias
properties.
I suppose the proper course of action is to consider adding the ability to export a schema's Avro Resolution Canonical Form.
Is the solution as simple as rewriting the following (java class) in go? https://github.com/apache/avro/blob/2bbb99602e9e925058ead86fc8ac4e27055b05d6/lang/java/avro/src/main/java/org/apache/avro/SchemaNormalization.java#L35-L45
I made use of that in this project to manually generate the parsing form for the avro schemas I need.
I could write a go implementation myself by copying from the java source but it'd be ugly.
Oh, and here are a few sample schemas and their fingerprints https://github.com/apache/avro/blob/17f2d75132021fafeca29edbdcade40df960fdc9/share/test/data/schema-tests.txt
Avro specification contains description of generating schema fingerprints. Is this functionality supported at all in the goavro package? Can you recommend the best way to generate a schema fingerprint?