Closed jussimalinen closed 11 years ago
Thanks for pointing this out. It's implemented and used (if available) now.
Thanks! A word of warning though: I run some test on different browsers, and for all modern browsers I tried, the decodeURIComponent(escape( -method is much faster, but on IE 8 it is much much slower.
If you dont care too much about IE 8 (or older), there is no problem.
Here are the time measurements in milliseconds for decoding a block of chinese characters. New method is the decodeURIComponent(escape( and old is the JXG.Util.utf8Decode function.
OSX new old safari 6.0.5 16 78 Firefox 23.0.1 1 6 Chrome 9 33
Windows Firefox 20 8 5 IE 8 6512 265 IE 9 4 100 IE 10 3 46
Wow, thanks a lot. I'll reopen this until I can verify and maybe fix this.
Utils.UTF8.encode/decode are only used for file imports and only once per file so the speed is not that important but 6 seconds is too much.
Would this be a good place for browser sniffing? The old approach could be used with IE < 9 and the new one otherwise. Based on the the following SO question, detecting IE by version is pretty easy. I especially like the approach in the second answer.
http://stackoverflow.com/questions/8310432/sniffing-ie8-only
@migerh Notice that these utils are also used outside JSXGraph. For example, in Robot Framework project http://robotframework.org we use base64 and UTF-8 decode and unzip utilities in our logs and reports since RF 2.6: https://code.google.com/p/robotframework/wiki/ReleaseNotes26#Acknowledgements
@pekkaklarck JSXGraph already determines the IE version and stores the major version number in JXG.ieVersion. This and all the other browser sniffing/feature detection code can be found in src/utils/env.js. I totally forgot about JSXCompressor, you're absolutely right. We should get it right then ;)
I wrote a quick benchmark which can be found here. The results are:
Arch Linux Chrome 29.0.1547.76 Decode#Old x 515,147 ops/sec ±0.61% (94 runs sampled) Decode#New x 1,423,676 ops/sec ±2.15% (88 runs sampled)
Firefox 24.0 Decode#Old x 817,134 ops/sec ±9.59% (80 runs sampled) Decode#New x 1,241,154 ops/sec ±14.60% (68 runs sampled)
Opera 12.15 (Build 1745) Decode#Old x 363,345 ops/sec ±1.51% (52 runs sampled) Decode#New x 399,889 ops/sec ±0.42% (63 runs sampled)
Windows XP IE 8.0.6001.18702 Decode#Old x 20,584 ops/sec ±1.07% (13 runs sampled) Decode#New x 268,504 ops/sec ±0.74% (14 runs sampled)
Windows 7 IE 9.0.8112.16421 Decode#Old x 313,861 ops/sec ±4.62% (33 runs sampled) Decode#New x 571,325 ops/sec ±3.19% (37 runs sampled)
Decode#New was reported as the fastest method every time.
@jussimalinen Can you try this test with your browser(s)?
I tried this quickly and got the same results as you did. The new method was fastest also on IE8.
I will try again so that I will modify your test script to have several thousand Chinese characters (like my test) and see if the results change... I hope to have time to do that later tonight. Right now I'm tied all week, all day on workshops (and related evening activities... :P)
Okay, I tried by creating a longer output in chinese. It seems that on IE8 the new method gets much slower when the size for the text increases.
I modified your test script for my tests (I just count the milliseconds for one decoding, because the IE would decide that the script is unresponsive and stop executing when using your benchmark approach for longer texts)
Here is my test script https://gist.github.com/jussimalinen/6705787
With this test I get IE 400ms for the new method and 20ms for the old method... For other browsers, both methods are very fast, but which one wins changes from browser to browser (and varies also when I change the size of the output.)
So the bottom-line seems to be, that if the output is very big, at some point this new URLDecode(escape( method becomes unbearably slow on IE8, but for that to happen, it must be several thousands of characters?
Using your input string generator in my benchmark page Decode#Old is roughly two times faster than Decode#New in every browser I tested (see post above what browsers except the IEs). The test in IE8 yields that Old is about 20 times faster than New for your string.
Now we have function Old that is faster than function New for very long strings, but fails to properly decode certain characters and we have function New that probably decodes all UTF8 characters properly but gets slower on all browsers the input string gets. Also, function New throws an exception if it encounters characters that are not allowed in a URI which at least gives the user a note if decoding that strings doesn't make sense.
I was curious when the speed would start to degrade so I started with reducing the number of repetitions of shortStr in your benchmark by 50%. It quickly came down to 2 repetitions and Old was still faster. So I started backwards using a single character (the one from your original post) and started repeating this character. Here are the results with the fastest function, all tested in Chrome 29 only:
1 New 2 New 3 Old 4 New (almost equal to Old) 5 New (almost equal) 6-10 Old (starting almost equal and the gap getting bigger quickly)
I think the best course of action would be to fix our old function and re-evaluate the fixed version against the new decoding routine. Maybe there's even a small library out there we could use. Finding something is hard because every other URL seems to point to that blog post you mentioned in your first post.
Yeah, I agree, it would seem to be best to fix the old function.
I think that the speed degrade might be related to the fact that escape method explodes the size of the string when it is full of characters that are actually escaped. For very large strings where each character has to be escaped, the size increase in considerable. When I run the example with all ascii letters (where escape does not do anything) the new method always seems to be much faster on modern browsers (I tried chrome and firefox.)
I agree that fixing the old function is likely the best solution.
I found an interesting and apparently pretty efficient UTF-8 algorithm implemented using C from http://bjoern.hoehrmann.de/utf-8/decoder/dfa/.
I was able to port this algorithm to Python and implementing it in JavaScript ought to be rather easy too. You can see my Python version at https://gist.github.com/pekkaklarck/6711623
What do you think?
Here's a JavaScript implementation of the same UTF-8 algorithm mentioned above: https://gist.github.com/pekkaklarck/6712255
I tested with Jussi's simple performance test and got similar results as with the old algorithm.
Note that because String.fromCharCode does not characters outside the BMP, the implementation uses String.fromCodePoint from https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCharCode. fromCodePoint could obviously be implemented as a stand-alone helper method too.
I translated that C99 code to javascript, too. Results look good, your initial test character is properly decoded as well as the test string in your benchmark.
Here's the code inside my benchmark: https://gist.github.com/migerh/6712140
You can use it to test it against different characters and browsers.
@migerh Great that the benchmark results are promising!
Unfortunately the original issue of not handling chars outside the BMP remains if String.fromCharCode is used with codepoints > 0xFFFF. In my gist I used a custom String.fromCodePoint function, but it seems to slow down the algorithm. With fromCharCode I get slightly better results than with the old algorithm, but with fromCodePoint it takes twice that time.
Using fromCodePoint only when cp > 0xFFFF and using fromCharCode otherwise gets the performance very close to the old algo and still fixes the original issue. Instead of adding custom String.fromCodePoint, it might be a better idea to handle codepoints > 0xFFFF in the decode algorithm itself.
Unfortunately the original issue of not handling chars outside the BMP remains if String.fromCharCode is used with codepoints > 0xFFFF.
I implemented fromCodePoint directly into the decode() function ;) See https://gist.github.com/migerh/6712140#file-benchmark-utf8-html-L70
I tested this with your the example character '\xf0\x9f\x99\x8a'
- newerDecode() yields the same result as newDecode()
//edit: If I deactivate the codep > 0xffff branch the algorithm gets significantly faster just like yours if you use fromCharCode instead of fromCodePoint for alle codepoint values. But as you already stated this would reintroduce the original problem with characters outside the BMP.
@migerh True, I had missed that if (codep > 0xffff)
.
With your benchmark gist I now get following results w/ Chrome on Linux: Decode#Old x 88,315 ops/sec ±0.70% (96 runs sampled) Decode#New x 42,856 ops/sec ±0.50% (96 runs sampled) Decode#Newer x 73,237 ops/sec ±0.58% (95 runs sampled)
Pretty close to the original. Should test with other inputs too, though. Unfortunately I don't have time for that myself just now.
Should test with other inputs too, though. Unfortunately I don't have time for that myself just now.
Take your time. I will implement the translated code from Björn Höhrmann into JSXGraph for our next release we intend to publish this weekend. If you find more bugs or performance issues we'll gladly fix them and release a new version of JSXCompressor (which is mostly independent from JSXGraph itself).
I'll close this bug for now, if you encounter any issues please comment and I'll reopen it.
Great that you are going to add this new algorithm to JSXGraph! The current performance penalty compared to the original is so small that at least I'm willing to live with it.
@migerh Initializing codep
to zero inside the loop in revision 576efbe seems wrong to me.
@pekkaklarck Thanks to your remark I got an idea to further improve the performance of UTF8.decode(): 88c9b44. This changeset basically doubles the performance of the new decode(), at least in Chrome and Firefox. It shouldn't break anything, but please feel free to fire all the characters on it you can find.
@migerh Great enhancement! I tested with Firefox and Chrome on Linux and IE8/XP and IE10/Win7. With all of them, this last change enhanced the performance considerably. The two first are now ever faster than the original algorithm, and the latter two are close enough.
Now that we have both fixed the original problem and enhanced the performance with modern browsers, I think it's time to leave this issue. The only change I would consider is renaming the string
variable to charcodes
. That would match its current content and would also avoid using string
and String
in the same function.
Yeah, string
and byte
were unfortunate choices and char
was never used. I removed char
and refactored string
to chars
and byte
to charCode
.
Thanks a lot to you both @pekkaklarck and @jussimalinen for your help in finding and fixing this issue.
Can I download a fixed version of jsxcompressor.min.js somewhere? The download available at http://jsxgraph.uni-bayreuth.de/wp/download/ seems to contain the old version and you apparently don't have the minimized version in version control. I obviously can also generate the file myself if there are instructions available.
Based on the timestamp and the size the jsxcompressor.js in the jsxcompressor.zip download is the current minimized version based on JSXGraph 0.98. If you want to be sure, download this version Alfred built just after the release of JSXGraph 0.98. The jsxcompressor.min.js in the zip file seems to be an outdated version, I don't know how it got in there because there is no jsxcompressor.min.js in our repository.
The zip file at http://jsxgraph.uni-bayreuth.de/distrib/jsxcompressor.zip contains both jsxcompressor.min.js and jsxcompressor.js. I would prefer to use the former because it has copyrights, but it seems to be from April 2012. The latter is new but for some reason using JXG.Util.utf8Decode fails with TypeError
. This file also has very strange looking window.JXG=r("../build/compressor.deps.js")
at the end.
The error I got is this:
TypeError: Object #<Object> has no method 'utf8Decode'
With JSXGraph 0.97 there were some major changes to the code that also affected JSXCompressor. The most notable of these changes is the conversion to AMD modules which explains the window.JXG = r(...) line at the end. We use AMD to resolve dependencies and build the different JSXGraph based projects.
The other change was moving the encoding functions into a separate namespace JXG.Util.UTF8, i.e. instead of JXG.Util.utf8Decode() you'll have to call JXG.Util.UTF8.decode() now. We could add a wrapper so you can still use JXG.Util.utf8Decode() or you could update your code to use JXG.Util.UTF8.decode().
I'm currently working on fixing our Makefile and uploading a proper jsxcompressor.zip.
Thanks, JXG.Util.UTF8.decode() works great! We currently use JXG like below. Do you see other cases where we are using deprecated or otherwise unoptimal API?
function extract(text) {
var decoded = JXG.Util.Base64.decodeAsArray(text);
var extracted = (new JXG.Util.Unzip(decoded)).unzip()[0][0];
return JXG.Util.UTF8.decode(extracted);
}
What changes will the new "proper" jsxcompressor.zip contain? Should I wait for it until I commit the updated jsxcompressor.js to our version control?
Do you see other cases where we are using deprecated or otherwise unoptimal API?
Looks fine.
What changes will the new "proper" jsxcompressor.zip contain? Should I wait for it until I commit the updated jsxcompressor.js to our version control?
I removed jsxgraphcore.js and jsxcompressor.js, updated jsxcompressor.min.js and added jsxgraphcore.min.js. Technically, jsxcompressor.min.js is a snapshot of the current master branch but I checked and the only changeset that involves JSXCompressor is 5691a9f06 and this is completely irrelevant after the optimization step.
I also added a new file COPYING which holds the copyright notice which is added to jsxcompressor.min.js after the optimization.
jsxcompressor.zip was just updated.
I noticed you now also have jsxcompressor.min.js at https://github.com/jsxgraph/jsxgraph/blob/master/JSXCompressor/jsxcompressor.min.js. Apparently I could use that instead the version in the zip? Anyway, good there now are copyrights (although year is 2008-2012).
Looking at the original code at https://github.com/jsxgraph/jsxgraph/blob/master/src/utils/encoding.js I noticed that the encode
method still uses unescape(encodeURIComponent(string))
trick. I would assume it's as slow as its decode counterpart and thus perhaps should be removed. The old UTF-8 encode algorithm doesn't seem to handler characters outside the BMP, though. This isn't important to us since we only need decode, but fixing that or at least adding a note that only BMP is supported might be a good idea.
Hi,
Our team has encountered a problem with the new JSX.Util.UTF8.decode "RangeError: Maximum call stack size exceeded" Will occur with large enough strings. The actual limit is browser specific and the exception occurs inside of String.fromCharCode.apply(null, chars);
An example code to test this x = [] while(x.push(97) < 67000) {} // With Safari 6.0.5 String.fromCharCode.apply(null, x)
Same limit with Chrome is something less than 150000.
With Firefox the limit is a much larger number (500000). This says : RangeError: array passed to ...
Thanks for reporting and fixing this issue. I just merged it. We will release an update of JSXCompressor soon.
Just a small performance bug there :D .. if (j % 10000) is almost always true.
Oh, right... fixed it.
@migerh Thanks for the quick response!
If I try to decode characters outside the Basic Multilingual Plane, the decoding function fails.
For example: JXG.Util.utf8Decode('\xf0\x9f\x99\x8a')
This should be one unicode character, but it decodes to two.
There could be many ways to fix this, but what we have done, is to work around this with this very interesting solution: http://ecmanaut.blogspot.ca/2006/07/encoding-decoding-utf8-in-javascript.html
so: decodeURIComponent(escape('\xf0\x9f\x99\x8a'))
It is a hack with a deprecated function, but it works for unicode characters in the astral plane and also as the functions are implemented on native code, it is very fast.