hoophq / sequence

Immutable, scalable, and easy to use ledger service.
Apache License 2.0
493 stars 31 forks source link

Fix unpredictable md5 sum #14

Closed andriosrobert closed 3 years ago

andriosrobert commented 4 years ago
didiercrunch commented 4 years ago

I doubt that will work in the general case.

Computing cryptographic hash code of general JSON object is super hard; it is a job for a cryptographer. The task is hard because you need to make sure of two things

  1. each different JSON object must produce a different hash code. In a none cryptographic implementation, usually {"a": "b"} and {"b": "a"} produces the same output; that would be something bad for cryptographic hash.
  2. each representation of the same JSON object must produce the same outcome. {"a": "b"} and {"a": "b"} must produce the same output. This is why you cannot rely on a "regular" JSON library like clojure.core.json.. they are not made for that.

I believe you have two options. I would vote strongly for the second option.

  1. find a library that does cryptographic hashing of json objects.
  2. hash your data as typed tuple. If you want to hash the map {:foo 900 :bar "sss" :toto 34.33} you should hash tuple [900, "sss", 34.33] knowing the types are [Long, utf-8 String, Double]. Your hashing library supports all these types.
andriosrobert commented 4 years ago

Thanks for the detailed explanation!

Not sure if you checked it, but I completely removed the JSON serialization from the process. I'm sticking to the EDN format using a data structure that ensures the keys will be always ordered.

didiercrunch commented 4 years ago

I have looked into your change but it will not change the underlying issue. You use serialization as canonical representation of data structure; that will result into bugs in the general case.

The bellow example still shows the problem..

(let [m {:foo 300 :toto 600}
      mm {:foo m}
      t (assoc (assoc {} :toto 600) :foo 300)
      tt {:foo t}]
  (println mm (map->md5 mm)) ;; {:foo {:foo 300, :toto 600}} d8a50edd5aeb65be51deb87eefc48f93

  (println tt (map->md5 tt))) ;; {:foo {:foo 300, :toto 600}} 52a197cb51d65d93a047df1b7fa994aa

Actually, the bellow code worked with the json implementation but not the prn-str.

user=> (map->md5 {:foo '(1)})
"d53b5f3fd7e4d2d7d6ce5795f308ef7e"
user=> (map->md5 {:foo [1]})
"4dafdc7833269033d683da9aeff61ed9"

I still believe you should only hash things data structure you know exactly the format.

lnmunhoz commented 4 years ago

Any updates on this? @andriosr @Jumanjii

andriosrobert commented 3 years ago

Hey @didiercrunch - taking the idea of the tuple and trying to make it easier for others to make this check, I decided to merge the fields in a specific order using a String.

Any thoughts?

andriosrobert commented 3 years ago

Merging this as the test cases are covered and we are using a single primitive type (String), which removes the problem of organizing a data structure in different ways when hashing