kevinko / rabin

Go Rabin Fingerprinting (SSE2)
BSD 3-Clause "New" or "Revised" License
16 stars 6 forks source link

Generating a 64-bit Rabin Fingerprint (as recommended in the Avro * spec) of a byte string #2

Open astrekalova opened 8 years ago

astrekalova commented 8 years ago

I need to generate a Rabin fingerprint for an avro schema so that it can be decoded by a Java application. I used the method RabinFingerprintFixed for generating a fingerprint of a JSON string converted to bytes. However the resulting value is different than the one generated by a method in the avro Java library. Here is the Java code used for generating a fingerprint. Unfortunately I am not so fit in understanding how the algorithm works as to understand if the Go library and the Java library are doing the same thing. Perhaps I am using the wrong method? I would really appreciate your help because I am pretty stuck!

public static long fingerprint64(byte[] data) { long result = EMPTY64; for (byte b: data) result = (result >>> 8) ^ FP64.FP_TABLE[(int)(result ^ b) & 0xff]; return result; } /* An inner class ensures that FP_TABLE initialized only when needed. */ private static class FP64 { private static final long[] FP_TABLE = new long[256]; static { for (int i = 0; i < 256; i++) { long fp = i; for (int j = 0; j < 8; j++) { long mask = -(fp & 1L); fp = (fp >>> 1) ^ (EMPTY64 & mask); } FP_TABLE[i] = fp; } } }

And this is how I generate a fingerprint: file, err := ioutil.ReadFile("ads_raw_bid_stream_request.avsc") var fingerprint = rabin.RabinFingerprintFixed(file)

kevinko commented 8 years ago

Hi,

The typical way would be to use the hash.Hash64 (or RollingHash if you need rolling windows) interface in either rabin.go or rabin64.go. RabinFingerprintFixed performs things the long-way using polynomial division and was just used for comparing results. However, results should be identical. What are you using for EMPTY64 in your java code? Results will depend on the chosen irreducible polynomial.

On Sun, Oct 25, 2015 at 4:24 PM, Alexandra notifications@github.com wrote:

I need to generate a Rabin fingerprint for an avro schema so that it can be decoded by a Java application. I used the method RabinFingerprintFixed for generating a fingerprint of a JSON string converted to bytes. However the resulting value is different than the one generated by a method in the avro Java library. Here is the Java code used for generating a fingerprint. Unfortunately I am not so fit in understanding how the algorithm works as to understand if the Go library and the Java library are doing the same thing. Perhaps I am using the wrong method? I would really appreciate your help because I am pretty stuck!

public static long fingerprint64(byte[] data) { long result = EMPTY64; for (byte b: data) result = (result >>> 8) ^ FP64.FP_TABLE[(int)(result ^ b) & 0xff]; return result; } /* An inner class ensures that FP_TABLE initialized only when needed. */ private static class FP64 { private static final long[] FP_TABLE = new long[256]; static { for (int i = 0; i < 256; i++) { long fp = i; for (int j = 0; j < 8; j++) { long mask = -(fp & 1L); fp = (fp >>> 1) ^ (EMPTY64 & mask); } FP_TABLE[i] = fp; } } }

— Reply to this email directly or view it on GitHub https://github.com/kevinko/rabin/issues/2.

astrekalova commented 8 years ago

final static long EMPTY64 = 0xc15d213aa4d7a795L;

I'm not really sure if the algorithm in java uses the rolling windows or not. As far as I can see, I can not provide an argument for hash.Hash64?

kevinko commented 8 years ago

Ah. That would be why. The polynomial used in the package is 0x59cd8807ac3e4017, which is defined in rabin.go. I probably should parameterize that. I'll try to do that later this week.

If you're just fingerprinting an entire block data, which is what your java sample is doing, you won't need rolling window support. Rolling windows don't affect the computed fingerprint. They just let you determine fingerprints for a sliding window of fixed size across a larger block of data with O(1) cost per update.

On Mon, Oct 26, 2015 at 8:35 AM, Alexandra notifications@github.com wrote:

final static long EMPTY64 = 0xc15d213aa4d7a795L;

I'm not really sure if the algorithm in java uses the rolling windows or not. As far as I can see, I can not provide an argument for hash.Hash64?

— Reply to this email directly or view it on GitHub https://github.com/kevinko/rabin/issues/2#issuecomment-151176867.

astrekalova commented 8 years ago

Ok now I understand why the function produces a different result. So will it be possible for you to provide a new function for generating a fingerprint for a given polynomial and a []byte? Or is it enough to substitute the used polynomial in RabinFingerprintFixed to get the same result as in the java program?

kevinko commented 8 years ago

I'll try to make it possible for you to specify your own polynomial. If you'd like to experiment in the meantime, you can try modifying the constant in rabin.go. Note that it's still possible that the Java version might use a slightly different algorithm since the avro page says something about doing something out of spec with the example. I'll look at it more carefully later when I have some free time.

On Tuesday, October 27, 2015, Alexandra notifications@github.com wrote:

Ok now I understand why the function produces a different result. So will it be possible for you to provide a new function for generating a fingerprint for a given polynomial and a []byte? Or is it enough to substitute the used polynomial in RabinFingerprintFixed to get the same result as in the java program?

— Reply to this email directly or view it on GitHub https://github.com/kevinko/rabin/issues/2#issuecomment-151562445.