Unicode issue - Githubissues

srijs / rusha

High-performance pure-javascript SHA1 implementation suitable for large binary data, reaching up to half the native speed.

https://npmjs.org/rusha

MIT License

277 stars 32 forks source link

Unicode issue #10

Closed msmuenchen closed 8 years ago

msmuenchen commented 10 years ago

Hi,

I'm having problems using rusha for comparing a string in Javascript with the same string hashed in PHP.

In Javascript, I use

var sha=new Rusha(); sha.digest("\u00e4")
"7e5c0f7aba32cf3e22fd30c4513a21e6d1c3aeff"

and in PHP (once using a literal ä, once a json-decode'd ä to rule out a bug in PHP or my file encoding)

$c1="ä";
$c2=json_decode("\"\\u00e4\"");
echo "1: -$c1- 2: -$c2-\n";
echo json_encode($c1)."\n";
echo json_encode($c2)."\n";

echo sha1($c1)."\n";
echo sha1($c2)."\n";

which gives me the output

1: -ä- 2: -ä-
"\u00e4"
"\u00e4"
961fa22f61a56e19f3f5f8867901ac8cf5e6d11f
961fa22f61a56e19f3f5f8867901ac8cf5e6d11f

Why are the SHA1 hashes different? After all, using the \u00e4 notation should result in the same byte sequence both in a PHP string and a Javascript string, right?

msmuenchen commented 10 years ago

Found out the reason: JS strings are UTF16-stored, while PHP assumes multi-byte with UTF8. Fix is easy with the library at http://www.onicos.com/staff/iz/amuse/javascript/expert/utf.txt; I described the usage in http://stackoverflow.com/questions/19835609/differing-sha1-hashes-for-identical-values-on-the-server-and-the-client/21341088#21341088 where someone had a similar issue.

Might be worth to incorporate this conversion into the digest function?

srijs commented 10 years ago

Yes, it might be worth adding an encoding parameter to the digest method, which would be evaluated in the conversion function.

Would you like to make the change and submit a PR?

msmuenchen commented 10 years ago

I'm not that deep into JS, can you please do it?

srijs commented 10 years ago

I'm a bit short on time at the moment, but I'll see if I can get around to it sometime next week.

Anyway, thanks for pointing that out!

stuartpb commented 10 years ago

I'll submit a patch that runs unescape(encodeURIComponent(str)) on the string before interpreting it (this converts the string to its equivalent UTF-8 character codes).

stuartpb commented 10 years ago

Where exactly would I insert that? https://github.com/srijs/rusha/blob/master/rusha.js#L164 looks like a good candidate.

srijs commented 10 years ago

Hi.

Please modify rusha.sweet.js. A good candidate would be the rawDigest method. It could take an optional options parameter, where you can opt-in to the unescape(encodeURIComponent(str)) conversion.

sergeevabc commented 10 years ago

var r = new Rusha(); alert(r.digest("любовь"));

af48c12732ffdbd4299b792c2b6da6f77a0898d7 expected (works with jsSHA, CryptoJS, JSHash) 09c65cdd36ba4e6d767cde9acc71dfa75380655c rusha :(

Could be so kind and fix UTF8 issue at last?

szydan commented 8 years ago

@sergeevabc in case you still need it - from the documentation (readme) "Create a hex digest from a binary String. A binary string is expected to only contain characters whose charCode < 256"

So the library will not work on arbitrary strings The workaround I found for your case is to first convert your utf-8 encoded string to byte array and then pass it to rusha. See the code below:

function toUTF8Array(str) {
    var utf8 = [];
    for (var i=0; i < str.length; i++) {
        var charcode = str.charCodeAt(i);
        if (charcode < 0x80) utf8.push(charcode);
        else if (charcode < 0x800) {
            utf8.push(0xc0 | (charcode >> 6),
                      0x80 | (charcode & 0x3f));
        }
        else if (charcode < 0xd800 || charcode >= 0xe000) {
            utf8.push(0xe0 | (charcode >> 12),
                      0x80 | ((charcode>>6) & 0x3f),
                      0x80 | (charcode & 0x3f));
        }
        // surrogate pair
        else {
            i++;
            // UTF-16 encodes 0x10000-0x10FFFF by
            // subtracting 0x10000 and splitting the
            // 20 bits of 0x0-0xFFFFF into two halves
            charcode = 0x10000 + (((charcode & 0x3ff)<<10)
                      | (str.charCodeAt(i) & 0x3ff));
            utf8.push(0xf0 | (charcode >>18),
                      0x80 | ((charcode>>12) & 0x3f),
                      0x80 | ((charcode>>6) & 0x3f),
                      0x80 | (charcode & 0x3f));
        }
    }
    return utf8;
}

var r = new Rusha();
var s = "любовь"
var a = toUTF8Array(s)
console.log(r.digest(a));  //will give you the correct sha1 af48c12732ffdbd4299b792c2b6da6f77a0898d7

sergeevabc commented 8 years ago

Thanks for your input, @szydan. At that time I chose Fast SHA256.

srijs commented 8 years ago

Closing this as wontfix -- Rusha is not meant to be used directly on encoded strings with code-points above 255. If you want to hash strings like these, please be sure to convert them into the desired binary encoding beforehand.