taoensso / nippy

The fastest serialization library for Clojure
https://www.taoensso.com/nippy
Eclipse Public License 1.0
1.04k stars 60 forks source link

checksum change for serialised output between 3.2.0 and 3.3.0 #163

Closed lowecg closed 5 months ago

lowecg commented 5 months ago

My application relies on nippy serialised output to compute a checksum for change detection.

The output format has changed between 3.2.0 and 3.3.0, altering checksums, even for elementary data.

;; commons-codec/commons-codec   {:mvn/version "1.16.0"}

(defn md5-hash ^String [^bytes ba]
  (org.apache.commons.codec.digest.DigestUtils/md5Hex ba))

(deftest nippy-checksum-test
  (testing "nippy checksum"
    ;; determine if changes to nippy will affect the checksum
    (is (= (->> {"Simple" "Map"}
                (nippy/fast-freeze)
                (md5-hash))
           "52c7cda365abac00d0ca5e8312c83d1e")))) ;; based on nippy 3.2.0

With nippy 3.3.0

Assertion failed:
nippy checksum

Expected :52c7cda365abac00d0ca5e8312c83d1e
Actual   :dfa93717e4e6a4603c6f2aa877474d55

I appreciate that you need to evolve the format from time to time. It would be good to add a nippy test to detect such changes, and at least serialised output changes could be reported in the release notes to understand the upgrade implications better. I suppose I'm asking if you could consider output format changes a breaking change.

I'm torn between reaping the benefits of improvements you're bringing to Nippy and the stability of the output format. Yes, please, to better speed and efficiency! However, in my case, checksum recalcs are expensive, so I'd have to cast my vote towards stability on the output format between minor release versions.

The checksum between 3.3.0 -> 3.4.0-beta1 seems stable. Are there any plans that would affect that? For example, could I move to 3.3.0 now and take the re-calc hit, knowing that when 3.4.0 is released, the output format will be stable?

ptaoussanis commented 5 months ago

@lowecg Hi there, thanks for pinging about this - and for the clear info! 👍

There were 2 changes in v3.3 likely to affect your case: here and here.

Nippy hasn't generally promised stability of its output format, but probably has been pretty stable in practice - and should continue to be.

The above changes were a bit of an exception since I was looking at a particular use case where even such micro-optimizations meant a difference of gigabytes.

Apologies for the trouble!

I've just updated the v3.3.0 changelog to include a warning:

Due to micro-optimizations of some elementary types, Nippy v3.3 may produce different serialized output to earlier versions of Nippy. Most users won't care about this, but you could be affected if you depend on specific serialized byte values (for example by comparing serialized output between different versions of Nippy).

--

It would be good to add a nippy test to detect such changes, and at least serialised output changes could be reported in the release notes to understand the upgrade implications better.

Absolutely agreed, this is a great idea 👍

PR very welcome if you feel like adding such a test, otherwise I'll take care of it myself when preparing the next v3.4 pre-release.

I believe a test should be straightforward - we can just capture + store a hash for each value in here: https://github.com/taoensso/nippy/blob/79f0212431d3e387ff39a48a6549318ae11167f2/src/taoensso/nippy.clj#L2025

We can then monitor the stored hashes to detect any unexpected changes. I'll try minimize such changes in future, and will mention them in release notes when they do occur (batched, and for good reasons).

The checksum between 3.3.0 -> 3.4.0-beta1 seems stable. Are there any plans that would affect that?

No other changes are specifically planned, but if you're not in a hurry - I'd recommend maybe holding off till the next pre-release (ETA ~Mar). Now that this topic's on my radar, I'll want to properly review it when preparing the next pre-release. That'd be the ideal time to start including the above-mentioned test.

lowecg commented 5 months ago

Thank you, Peter. That's all very reasonable.

I'll create a PR for the test - my first thoughts were also to take a checksum based on the stress test. That'll serve nicely as a checksum canary.

ptaoussanis commented 5 months ago

👍 It's possible that you may need to dissoc a few more types from nippy/stress-data-comparable. Most of the elementary types should be binary stable, but I'm not sure offhand about about types using Java's Serializable, etc.

ptaoussanis commented 5 months ago

Tracking of serialized output will be included in v3.4.0-beta2, thanks again for bringing this to my attention! 🙏

lowecg commented 5 months ago

Thank you for your feedback and support on this, Peter. Very much appreciated!