JXG.Util.utf8Decode does not handle characters outside of Unicode BMP

jussimalinen commented 11 years ago

If I try to decode characters outside the Basic Multilingual Plane, the decoding function fails.

For example: JXG.Util.utf8Decode('\xf0\x9f\x99\x8a')

This should be one unicode character, but it decodes to two.

There could be many ways to fix this, but what we have done, is to work around this with this very interesting solution: http://ecmanaut.blogspot.ca/2006/07/encoding-decoding-utf8-in-javascript.html

so: decodeURIComponent(escape('\xf0\x9f\x99\x8a'))

It is a hack with a deprecated function, but it works for unicode characters in the astral plane and also as the functions are implemented on native code, it is very fast.

migerh commented 11 years ago

Thanks for pointing this out. It's implemented and used (if available) now.

jussimalinen commented 11 years ago

Thanks! A word of warning though: I run some test on different browsers, and for all modern browsers I tried, the decodeURIComponent(escape( -method is much faster, but on IE 8 it is much much slower.

If you dont care too much about IE 8 (or older), there is no problem.

Here are the time measurements in milliseconds for decoding a block of chinese characters. New method is the decodeURIComponent(escape( and old is the JXG.Util.utf8Decode function.

OSX new old safari 6.0.5 16 78 Firefox 23.0.1 1 6 Chrome 9 33

Windows Firefox 20 8 5 IE 8 6512 265 IE 9 4 100 IE 10 3 46

migerh commented 11 years ago

Wow, thanks a lot. I'll reopen this until I can verify and maybe fix this.

Utils.UTF8.encode/decode are only used for file imports and only once per file so the speed is not that important but 6 seconds is too much.

pekkaklarck commented 11 years ago

Would this be a good place for browser sniffing? The old approach could be used with IE < 9 and the new one otherwise. Based on the the following SO question, detecting IE by version is pretty easy. I especially like the approach in the second answer.

http://stackoverflow.com/questions/8310432/sniffing-ie8-only

pekkaklarck commented 11 years ago

@migerh Notice that these utils are also used outside JSXGraph. For example, in Robot Framework project http://robotframework.org we use base64 and UTF-8 decode and unzip utilities in our logs and reports since RF 2.6: https://code.google.com/p/robotframework/wiki/ReleaseNotes26#Acknowledgements

migerh commented 11 years ago

@pekkaklarck JSXGraph already determines the IE version and stores the major version number in JXG.ieVersion. This and all the other browser sniffing/feature detection code can be found in src/utils/env.js. I totally forgot about JSXCompressor, you're absolutely right. We should get it right then ;)

I wrote a quick benchmark which can be found here. The results are:

Arch Linux Chrome 29.0.1547.76 Decode#Old x 515,147 ops/sec ±0.61% (94 runs sampled) Decode#New x 1,423,676 ops/sec ±2.15% (88 runs sampled)

Firefox 24.0 Decode#Old x 817,134 ops/sec ±9.59% (80 runs sampled) Decode#New x 1,241,154 ops/sec ±14.60% (68 runs sampled)

Opera 12.15 (Build 1745) Decode#Old x 363,345 ops/sec ±1.51% (52 runs sampled) Decode#New x 399,889 ops/sec ±0.42% (63 runs sampled)

Windows XP IE 8.0.6001.18702 Decode#Old x 20,584 ops/sec ±1.07% (13 runs sampled) Decode#New x 268,504 ops/sec ±0.74% (14 runs sampled)

Windows 7 IE 9.0.8112.16421 Decode#Old x 313,861 ops/sec ±4.62% (33 runs sampled) Decode#New x 571,325 ops/sec ±3.19% (37 runs sampled)

Decode#New was reported as the fastest method every time.

@jussimalinen Can you try this test with your browser(s)?

jussimalinen commented 11 years ago

I tried this quickly and got the same results as you did. The new method was fastest also on IE8.

I will try again so that I will modify your test script to have several thousand Chinese characters (like my test) and see if the results change... I hope to have time to do that later tonight. Right now I'm tied all week, all day on workshops (and related evening activities... :P)

jussimalinen commented 11 years ago

Okay, I tried by creating a longer output in chinese. It seems that on IE8 the new method gets much slower when the size for the text increases.

I modified your test script for my tests (I just count the milliseconds for one decoding, because the IE would decide that the script is unresponsive and stop executing when using your benchmark approach for longer texts)

Here is my test script https://gist.github.com/jussimalinen/6705787

With this test I get IE 400ms for the new method and 20ms for the old method... For other browsers, both methods are very fast, but which one wins changes from browser to browser (and varies also when I change the size of the output.)

So the bottom-line seems to be, that if the output is very big, at some point this new URLDecode(escape( method becomes unbearably slow on IE8, but for that to happen, it must be several thousands of characters?

migerh commented 11 years ago

Using your input string generator in my benchmark page Decode#Old is roughly two times faster than Decode#New in every browser I tested (see post above what browsers except the IEs). The test in IE8 yields that Old is about 20 times faster than New for your string.

Now we have function Old that is faster than function New for very long strings, but fails to properly decode certain characters and we have function New that probably decodes all UTF8 characters properly but gets slower on all browsers the input string gets. Also, function New throws an exception if it encounters characters that are not allowed in a URI which at least gives the user a note if decoding that strings doesn't make sense.

I was curious when the speed would start to degrade so I started with reducing the number of repetitions of shortStr in your benchmark by 50%. It quickly came down to 2 repetitions and Old was still faster. So I started backwards using a single character (the one from your original post) and started repeating this character. Here are the results with the fastest function, all tested in Chrome 29 only:

1 New 2 New 3 Old 4 New (almost equal to Old) 5 New (almost equal) 6-10 Old (starting almost equal and the gap getting bigger quickly)

I think the best course of action would be to fix our old function and re-evaluate the fixed version against the new decoding routine. Maybe there's even a small library out there we could use. Finding something is hard because every other URL seems to point to that blog post you mentioned in your first post.

jussimalinen commented 11 years ago

Yeah, I agree, it would seem to be best to fix the old function.

I think that the speed degrade might be related to the fact that escape method explodes the size of the string when it is full of characters that are actually escaped. For very large strings where each character has to be escaped, the size increase in considerable. When I run the example with all ascii letters (where escape does not do anything) the new method always seems to be much faster on modern browsers (I tried chrome and firefox.)

pekkaklarck commented 11 years ago

I agree that fixing the old function is likely the best solution.

I found an interesting and apparently pretty efficient UTF-8 algorithm implemented using C from http://bjoern.hoehrmann.de/utf-8/decoder/dfa/.

I was able to port this algorithm to Python and implementing it in JavaScript ought to be rather easy too. You can see my Python version at https://gist.github.com/pekkaklarck/6711623

What do you think?

pekkaklarck commented 11 years ago

Here's a JavaScript implementation of the same UTF-8 algorithm mentioned above: https://gist.github.com/pekkaklarck/6712255

I tested with Jussi's simple performance test and got similar results as with the old algorithm.

Note that because String.fromCharCode does not characters outside the BMP, the implementation uses String.fromCodePoint from https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCharCode. fromCodePoint could obviously be implemented as a stand-alone helper method too.

migerh commented 11 years ago

I translated that C99 code to javascript, too. Results look good, your initial test character is properly decoded as well as the test string in your benchmark.

Here's the code inside my benchmark: https://gist.github.com/migerh/6712140

You can use it to test it against different characters and browsers.

pekkaklarck commented 11 years ago

@migerh Great that the benchmark results are promising!

Unfortunately the original issue of not handling chars outside the BMP remains if String.fromCharCode is used with codepoints > 0xFFFF. In my gist I used a custom String.fromCodePoint function, but it seems to slow down the algorithm. With fromCharCode I get slightly better results than with the old algorithm, but with fromCodePoint it takes twice that time.

Using fromCodePoint only when cp > 0xFFFF and using fromCharCode otherwise gets the performance very close to the old algo and still fixes the original issue. Instead of adding custom String.fromCodePoint, it might be a better idea to handle codepoints > 0xFFFF in the decode algorithm itself.

migerh commented 11 years ago

Unfortunately the original issue of not handling chars outside the BMP remains if String.fromCharCode is used with codepoints > 0xFFFF.

I implemented fromCodePoint directly into the decode() function ;) See https://gist.github.com/migerh/6712140#file-benchmark-utf8-html-L70

I tested this with your the example character '\xf0\x9f\x99\x8a' - newerDecode() yields the same result as newDecode()

//edit: If I deactivate the codep > 0xffff branch the algorithm gets significantly faster just like yours if you use fromCharCode instead of fromCodePoint for alle codepoint values. But as you already stated this would reintroduce the original problem with characters outside the BMP.

pekkaklarck commented 11 years ago

@migerh True, I had missed that if (codep > 0xffff).

With your benchmark gist I now get following results w/ Chrome on Linux: Decode#Old x 88,315 ops/sec ±0.70% (96 runs sampled) Decode#New x 42,856 ops/sec ±0.50% (96 runs sampled) Decode#Newer x 73,237 ops/sec ±0.58% (95 runs sampled)

Pretty close to the original. Should test with other inputs too, though. Unfortunately I don't have time for that myself just now.

migerh commented 11 years ago

Should test with other inputs too, though. Unfortunately I don't have time for that myself just now.

Take your time. I will implement the translated code from Björn Höhrmann into JSXGraph for our next release we intend to publish this weekend. If you find more bugs or performance issues we'll gladly fix them and release a new version of JSXCompressor (which is mostly independent from JSXGraph itself).

I'll close this bug for now, if you encounter any issues please comment and I'll reopen it.

pekkaklarck commented 11 years ago

Great that you are going to add this new algorithm to JSXGraph! The current performance penalty compared to the original is so small that at least I'm willing to live with it.

pekkaklarck commented 11 years ago

@migerh Initializing codep to zero inside the loop in revision 576efbe seems wrong to me.

migerh commented 11 years ago

@pekkaklarck Thanks to your remark I got an idea to further improve the performance of UTF8.decode(): 88c9b44. This changeset basically doubles the performance of the new decode(), at least in Chrome and Firefox. It shouldn't break anything, but please feel free to fire all the characters on it you can find.

pekkaklarck commented 11 years ago

@migerh Great enhancement! I tested with Firefox and Chrome on Linux and IE8/XP and IE10/Win7. With all of them, this last change enhanced the performance considerably. The two first are now ever faster than the original algorithm, and the latter two are close enough.

Now that we have both fixed the original problem and enhanced the performance with modern browsers, I think it's time to leave this issue. The only change I would consider is renaming the string variable to charcodes. That would match its current content and would also avoid using string and String in the same function.

migerh commented 11 years ago

Yeah, string and byte were unfortunate choices and char was never used. I removed char and refactored string to chars and byte to charCode.

Thanks a lot to you both @pekkaklarck and @jussimalinen for your help in finding and fixing this issue.

pekkaklarck commented 11 years ago

Can I download a fixed version of jsxcompressor.min.js somewhere? The download available at http://jsxgraph.uni-bayreuth.de/wp/download/ seems to contain the old version and you apparently don't have the minimized version in version control. I obviously can also generate the file myself if there are instructions available.

migerh commented 11 years ago

Based on the timestamp and the size the jsxcompressor.js in the jsxcompressor.zip download is the current minimized version based on JSXGraph 0.98. If you want to be sure, download this version Alfred built just after the release of JSXGraph 0.98. The jsxcompressor.min.js in the zip file seems to be an outdated version, I don't know how it got in there because there is no jsxcompressor.min.js in our repository.

pekkaklarck commented 11 years ago

The zip file at http://jsxgraph.uni-bayreuth.de/distrib/jsxcompressor.zip contains both jsxcompressor.min.js and jsxcompressor.js. I would prefer to use the former because it has copyrights, but it seems to be from April 2012. The latter is new but for some reason using JXG.Util.utf8Decode fails with TypeError. This file also has very strange looking window.JXG=r("../build/compressor.deps.js") at the end.

pekkaklarck commented 11 years ago

The error I got is this:

TypeError: Object #<Object> has no method 'utf8Decode'

migerh commented 11 years ago

With JSXGraph 0.97 there were some major changes to the code that also affected JSXCompressor. The most notable of these changes is the conversion to AMD modules which explains the window.JXG = r(...) line at the end. We use AMD to resolve dependencies and build the different JSXGraph based projects.

The other change was moving the encoding functions into a separate namespace JXG.Util.UTF8, i.e. instead of JXG.Util.utf8Decode() you'll have to call JXG.Util.UTF8.decode() now. We could add a wrapper so you can still use JXG.Util.utf8Decode() or you could update your code to use JXG.Util.UTF8.decode().

I'm currently working on fixing our Makefile and uploading a proper jsxcompressor.zip.

pekkaklarck commented 11 years ago

Thanks, JXG.Util.UTF8.decode() works great! We currently use JXG like below. Do you see other cases where we are using deprecated or otherwise unoptimal API?

    function extract(text) {
        var decoded = JXG.Util.Base64.decodeAsArray(text);
        var extracted = (new JXG.Util.Unzip(decoded)).unzip()[0][0];
        return JXG.Util.UTF8.decode(extracted);
    }

What changes will the new "proper" jsxcompressor.zip contain? Should I wait for it until I commit the updated jsxcompressor.js to our version control?

migerh commented 11 years ago

Do you see other cases where we are using deprecated or otherwise unoptimal API?

Looks fine.

What changes will the new "proper" jsxcompressor.zip contain? Should I wait for it until I commit the updated jsxcompressor.js to our version control?

I removed jsxgraphcore.js and jsxcompressor.js, updated jsxcompressor.min.js and added jsxgraphcore.min.js. Technically, jsxcompressor.min.js is a snapshot of the current master branch but I checked and the only changeset that involves JSXCompressor is 5691a9f06 and this is completely irrelevant after the optimization step.

I also added a new file COPYING which holds the copyright notice which is added to jsxcompressor.min.js after the optimization.

jsxcompressor.zip was just updated.

pekkaklarck commented 11 years ago

I noticed you now also have jsxcompressor.min.js at https://github.com/jsxgraph/jsxgraph/blob/master/JSXCompressor/jsxcompressor.min.js. Apparently I could use that instead the version in the zip? Anyway, good there now are copyrights (although year is 2008-2012).

Looking at the original code at https://github.com/jsxgraph/jsxgraph/blob/master/src/utils/encoding.js I noticed that the encode method still uses unescape(encodeURIComponent(string)) trick. I would assume it's as slow as its decode counterpart and thus perhaps should be removed. The old UTF-8 encode algorithm doesn't seem to handler characters outside the BMP, though. This isn't important to us since we only need decode, but fixing that or at least adding a note that only BMP is supported might be a good idea.

mkorpela commented 11 years ago

Hi,

Our team has encountered a problem with the new JSX.Util.UTF8.decode "RangeError: Maximum call stack size exceeded" Will occur with large enough strings. The actual limit is browser specific and the exception occurs inside of String.fromCharCode.apply(null, chars);

An example code to test this x = [] while(x.push(97) < 67000) {} // With Safari 6.0.5 String.fromCharCode.apply(null, x)

Same limit with Chrome is something less than 150000.

With Firefox the limit is a much larger number (500000). This says : RangeError: array passed to ...

migerh commented 11 years ago

Thanks for reporting and fixing this issue. I just merged it. We will release an update of JSXCompressor soon.

mkorpela commented 11 years ago

Just a small performance bug there :D .. if (j % 10000) is almost always true.

migerh commented 11 years ago

Oh, right... fixed it.

mkorpela commented 11 years ago

@migerh Thanks for the quick response!

jsxgraph / jsxgraph

JXG.Util.utf8Decode does not handle characters outside of Unicode BMP #50