replikativ / hasch

Cross-platform (JVM and JS atm.) edn data structure hashing for Clojure.
Eclipse Public License 1.0
111 stars 13 forks source link

Hashing Emojis in strings is not consistent #11

Open Erdromian opened 4 years ago

Erdromian commented 4 years ago

Running into an issue where strings don't hash the same cross-platform when emojis are present. I assume this extends to more Unicode as well.

(is (= "1a934e63-1467-50af-b644-a35343cb16b9" (str (hasch/uuid "👍 👍 👍")))) ; in Clojure (is (= "28fe1faf-09a0-5def-a1e6-78326e03882b" (str (hasch/uuid "👍 👍 👍")))) ; in ClojureScript

whilo commented 4 years ago

Ouch. Thanks for reporting this. We have only tested some unicode symbols here: https://github.com/replikativ/hasch/blob/master/test/hasch/test.cljc#L35 It seems this was not sufficient. Do you have some experience with unicode?

Erdromian commented 4 years ago

Unfortunately not. An internal user of mine just started using emoticons. I use hasch to validate things cross-platform, so I was considering disabling this specific validation for now.

But since you are actively maintaining this, I wouldn't mind looking into unicode and making a pull request in the next few days if you want.

whilo commented 4 years ago

That would be very good. Let me know when you get stuck and I will try to help. This bug is bad.

Erdromian commented 4 years ago

What is your setup for running and testing this project? So far, I have been getting stuck with the error Unable to resolve var: cemerick.piggieback/wrap-cljs-repl in this context on startup.

In the meantime, I started taking snippits of the code to test against.

Getting the byte values, things are consistent between the two so far. However, the 'utf8' step in (defn- ^bytes str->utf8 [x] (-> x str (.getBytes "UTF-8"))) vs (defn- str->utf8 [x] (-> x str utf8)) Is Returning differently cross-platform.

(defn bytes-to-int ([bytes] (bytes-to-int bytes 0)) ([bytes offset] (reduce + 0 (map (fn [i] (let [shift (* (- 4 1 i) 8)] (bit-shift-left (bit-and (nth bytes (+ i offset)) 0x000000FF) shift))) (range 0 4)))))

#?(:cljs (is (= -308232723 (bytes-to-int (utf8 emoji)))) :clj (is (= 4036989325 (bytes-to-int (.getBytes emoji "UTF-8")))))

Maybe my bytes-to-int function is off, I am just using it for seeing the results ATM. I will try some different ways to visualize the bytes to narrow it down. Then maybe I will look into the .getBytes Java implementation

Erdromian commented 4 years ago

Alright... I think the problem is that cljs utf8 coerces numbers to positive automatically, whereas (.getBytes emoji "UTF-8") has negative numbers in it. As soon as I get the project to start, I am going to add some more unicode tests, including emojis, and play with some changes.

I think either the .cljs version of 'utf8' needs to go into a signed int array instead of unsigned, or we need to coerce the results of (.getBytes s "UTF-8") to positive numbers using something like #(if (neg? %) (+ % 256) %) to make the results consistent.

whilo commented 4 years ago

Hey, sorry for taking so long, I had to work through some other issues first. I have updated the develop branch to properly compile and run tests with up to date dependencies, either with lein test for Clojure or lein cljsbuild test. The latter needs to be also run by the former...

Could you open a separate branch against develop and update me on your experiments? I think this issue is related to https://lambdaisland.com/blog/2017-06-12-clojure-gotchas-surrogate-pairs. Unfortunately I did not expect this inconsistency when I wrote the test for Unicode, I should have been more careful.

whilo commented 4 years ago

@Erdromian Did that fix your issue with the piggieback middleware?

Erdromian commented 4 years ago

No, I basically just stopped using it in my project. As for the fix, I ran out of time and had to move onto my main work for now. I did make a branch and added some more tests to it. However, I couldn't figure out how to make things consistent.

whilo commented 4 years ago

Ok, no problem. Thanks for reporting back. How did you make things consistent in your project? Did you manage to make JavaScript's and Java's UTF8 representation match?

mikekap commented 4 years ago

Just ran into this - I think the JS algorithm isn't entirely correct and doesn't handle UTF-16 surrogate pairs in JS. I was able to get around this by using the google closure fn that seems to do the right thing: goog.crypt.stringToUtf8ByteArray.

whilo commented 4 years ago

This is great news! Thanks so much for coming back here. I can incorporate it as soon as possible, but if you feel like opening a PR and become a contributor then I am also happy to review that.