micnic / uv

Ultrafast UTF-8 data validation
MIT License
79 stars 1 forks source link

This algorithm is ultra slow #1

Closed ghost closed 4 years ago

ghost commented 6 years ago

This kind of "lookup-table-change-state-repeat" algorithm is always the slowest possible algorithm for UTF-8 validation. It looks slim and "optimized" but really only serves to maximize SIMD-incompatibility since every step can take at maximum one character (your loop only handles one byte a time). Also the lookup table is so big and randomly accessed that CPU cache will be utilized horribly bad.

You would know this if you cared to benchmark other solutions.

ghost commented 6 years ago
        // Optimize for common ASCII bytes
        if (state === a && byte < 0x80) {
            index++;
            continue;
        }

This check is good as it skips the state change for most bytes, but you are still only checking 1 single byte a time. You can change this to check 4, 8, 16, 32 or whatever kind of batch length you find useful.

4 byte check can be done by reading an 32-bit integer in one go then checking & 0x80808080. Same goes with SIMD-checks of up to 16 or 32 bytes a time.

But still, this kind of lookup-table solutions is never more efficient than just implementing the logic as regular if-statements. That lookup-table is basically a data-representation of the flow of regular if-statements yet far less efficient but more "cool looking". Cool looking is rarely beneficial for performance though.

jridgewell commented 6 years ago

This kind of "lookup-table-change-state-repeat" algorithm is always the slowest possible algorithm for UTF-8 validation.

Not necessarily. Context, I rewrote V8's UTF-8 parser using one of these fancy DFA algorithms.

since every step can take at maximum one character (your loop only handles one byte a time). 4 byte check can be done by reading an 32-bit integer in one go then checking & 0x80808080. Same goes with SIMD-checks of up to 128 bytes a time.

This is valid. 4 byte lookups and comparing can significantly speed up common cases.

Also the lookup table is so big and randomly accessed that CPU cache will be utilized horribly bad.

Also valid. This table is 2048 ints long (in the best case, I don't think V8 will using a single uint8 for it), and the chances of the entire thing being stored on a CPU cache are very slim.

This'll lead to high pressure on the cache, which isn't good.

You would know this if you cared to benchmark other solutions.

Shitty comments are shitty. He did benchmark other solutions:

Constructive criticism:

  1. Try reimplementing the DFA using Bjoern Hoehrmann's optimized decoder (note, the DFA at the top of the page is not the optimized version. Search for "Notes on performance", the optimized version is just above it)
    • This is a considerably smaller DFA, with a much likelier chance of staying on the CPU cache
  2. Try using Buffer#readInt32BE to read a 32bit int and compare it to 0x80808080.
    • Be careful of boundary alignment, should be done on intervals of 4 bytes only.
    • I managed to eek out an extra 8m ops on the benchmarks doing this alone.
micnic commented 6 years ago

@alexhultman the purpose of this module is to add a better solution for UTF-8 validation than existing solutions in the node.js world and especially for pure JavaScript solutions, and in the benchmarks that I added to this repo it is faster in most cases, it is even faster than a C++ implementation in the case of https://github.com/websockets/utf-8-validate

Sure, there is space for improving and thanks for pointing what exactly can be improved, I'll continue to work on this and get better results.

micnic commented 6 years ago

@jridgewell thank you for the tips, while creating this module I was inspired by the Bjoern Hoehrmann's code, but tried to create a "simpler" solution. I tried some more tweaks, but the performance was only slightly better, I'll go deeper in the Hoehrmann's implementation and will try to get the maximum performance of a pure JavaScript solution.

jridgewell commented 6 years ago

I tried some more tweaks, but the performance was only slightly better

It's likely because the benchmarks don't test for long strings of mixed UTF-8 length. Something like 'test¢™💩' will stress the DFA a bit more, but this doesn't transition into 0xE0, 0xED, nor the 0xF1-F3 lead bytes.

Right now, there's only one mixed case (the random one), but that doesn't guarantee a valid string and you're likely exiting in the first few bytes.

ghost commented 6 years ago

For the sake of being ultra clear here:

The reason why I wrote this issue report was because I immediately recognized this algorithm as being a complete copy of Björn Höhrmann's and since I already have stepped in that pitfall I knew from the start it was slow.

But sure, keep on using this if you feel it is "ultrafast".

jridgewell commented 6 years ago

Validating the Swedish Wikipedia page on UTF-8 1 million times takes 24 seconds with this "ultrafast library" that "even beats C++".

Maybe creating a benchmark PR for this case?

"ultrafast library"

Being hostile doesn't help.

It takes 4 seconds to do the same with the algorithm I use in my projects.

Is your algorithm written in JS? Or C? I'm guessing it's if-statement based? These kinds of details would help, instead of attacking a project.

I already have stepped in that pitfall I knew from the start it was slow

Again, it depends on how optimized the DFA is. If the CPU cache is too small to handle it, it likely will be slower.

When I made the V8 PR, my benchmark data said otherwise. I'll review it to make sure it's not biased to ASCII bytes.

But sure, keep on using this if you feel it is "ultrafast".

Let's try being helpful, and not "helpful".

jridgewell commented 6 years ago

Cool dude.

To analyze his test case: https://sv.wikipedia.org/wiki/Portal:Huvudsida

Bytes ASCII Other Non-ASCII Lead Bytes Continuation Bytes
154865 148031 6834 2866 3968

Testing locally using C versions of if-statement based vs DFA:

If-Statement:      932.273389 MB/s, 0 errors
Hoehrmann:         379.220675 MB/s, 0 errors
Hoehrmann-ASCII:  1123.680631 MB/s, 0 errors

Note:

Did you ever bother to analyze your algorithm before changing it?

lpinca commented 6 years ago

There are a couple of issues with the benchmark in this repo:

  1. It only benchmark very small buffers, specifically 26 bytes long. For such small buffers, crossing the JS/C++ boundary doesn't make sense and in fact the pure JS version of utf-8-validate which can be required with require('utf-8-validate/fallback') is 2/3 time faster than its C++ counterpart.

  2. This line creates a zero filled buffer, not one with random values. If you change that line with something like Buffer.from(Array(1000).fill(0).map(() => Math.floor(Math.random() * 127))), utf-8-validate (native) is ~2x faster than uv.

This is only to say that this module is not faster than a C++ implementation, it depends on the size of the input. Also utf-8-validate is not optimized for the ASCII case. It only handles one byte at time.

micnic commented 6 years ago

This line creates a zero filled buffer, not one with random values

@lpinca oops, I somehow missed this, but in the early local benchmarks it was a truly random buffer, in any case I updated the bench.js file to load the main page of Romanian Wikipedia, in my bechmark it weights 111587 bytes, using node 10 I got the following results:

uv valid buffer x 5,591 ops/sec ±1.26% (89 runs sampled) utf-8-validate valid buffer x 4,687 ops/sec ±0.22% (95 runs sampled) isutf8 valid buffer x 4,391 ops/sec ±2.65% (89 runs sampled)

lpinca commented 6 years ago

This is what I get on my machine:

$ node -v
v10.1.0
$ node bench.js 
Loading Romanian Wikipedia main page ...
uv valid buffer x 3,529 ops/sec ±0.64% (86 runs sampled)
utf-8-validate valid buffer x 6,589 ops/sec ±0.68% (88 runs sampled)
isutf8 valid buffer x 2,579 ops/sec ±0.63% (88 runs sampled)
micnic commented 6 years ago

ok, I'll recheck my machine, I'm using nvm to switch between versions 6, 8 and 10, it might be some misconfiguration that gives such different results

lpinca commented 6 years ago

If you are switching between multiple node versions remember to reinstall/recompile utf-8-validate. It will use the fallback if the addon cannot be required.

micnic commented 6 years ago

I updated the benchmark to include utf-8-validate fallback JS code, rewritten the function using the "if statement" approach and improved quite a lot the performance, on my machine now I have the following results using node 10.1.0:

OSX

uv x 8,973 ops/sec ±2.01% (82 runs sampled) utf-8-validate (default, C++) x 9,477 ops/sec ±0.52% (91 runs sampled) utf-8-validate (fallback, JS) x 4,897 ops/sec ±0.52% (90 runs sampled) isutf8 x 4,901 ops/sec ±0.85% (88 runs sampled)

lpinca commented 6 years ago

This is what I get now

[luigi@fedora uv]$ cat /proc/cpuinfo | grep 'model name' | uniq
model name  : Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
[luigi@fedora uv]$ node -v
v10.1.0
[luigi@fedora uv]$ node bench.js 
Loading Romanian Wikipedia main page ...
uv x 6,560 ops/sec ±2.12% (90 runs sampled)
utf-8-validate (default, C++) x 14,538 ops/sec ±0.29% (91 runs sampled)
utf-8-validate (fallback, JS) x 4,014 ops/sec ±0.22% (96 runs sampled)
isutf8 x 3,892 ops/sec ±0.25% (96 runs sampled)

Also tried with https://ja.wikipedia.org/wiki/メインページ

diff --git a/bench.js b/bench.js
index 6a3bc6a..09bdf4b 100644
--- a/bench.js
+++ b/bench.js
@@ -4,14 +4,14 @@ const Benchmark = require('benchmark');

 const https = require('https');

-const uv = require('uv');
+const uv = require('.');
 const isValidUTF8 = require('utf-8-validate');
 const isValidUTF8Fallback = require('utf-8-validate/fallback');
 const isUtf8 = require('isutf8');

-https.get('https://ro.wikipedia.org/wiki/Pagina_principal%C4%83', (response) => {
+https.get('https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8', (response) => {

-       console.log('Loading Romanian Wikipedia main page ...');
+       console.log('Loading Japanese Wikipedia main page ...');

        let content = Buffer.alloc(0);

@@ -44,4 +44,4 @@ https.get('https://ro.wikipedia.org/wiki/Pagina_principal%C4%83', (response) =>
                // run async
                .run({ 'async': true });
        });
-});
\ No newline at end of file
+});
[luigi@fedora uv]$ node bench.js 
Loading Japanese Wikipedia main page ...
uv x 5,952 ops/sec ±2.04% (87 runs sampled)
utf-8-validate (default, C++) x 13,505 ops/sec ±0.33% (93 runs sampled)
utf-8-validate (fallback, JS) x 4,145 ops/sec ±0.42% (94 runs sampled)
isutf8 x 4,048 ops/sec ±0.52% (90 runs sampled)
micnic commented 6 years ago

ok, thank you for the output, will try to tweak it more and try other approaches to squeeze as much performance as possible.

Checked the benchmark on a linux machine, previous tests were using OSX

Ubuntu 16.04

uv x 4,855 ops/sec ±0.81% (86 runs sampled) utf-8-validate (default, C++) x 13,278 ops/sec ±0.91% (92 runs sampled) utf-8-validate (fallback, JS) x 2,948 ops/sec ±2.05% (86 runs sampled) isutf8 x 3,124 ops/sec ±0.72% (93 runs sampled)

lpinca commented 6 years ago

@micnic FYI, there is no check for oob access https://github.com/micnic/uv/blob/e42c94a39f82cd4e0e6795a622852160d838c166/index.js#L71 and that was super slow on older version of Node.js, not sure if it's still valid.

micnic commented 6 years ago

@lpinca yes, in versions 8 and lower it's quite slow, about 4 times slower than in version 10, but if I add the check for buffer boundaries it gets slower in node 10 :smile: will investigate this more.

ghost commented 6 years ago

@jridgewell Doesn't matter how fancy an algorithm you use or not use if you don't utilize SIMD instructions. Utf-8 validation is all about finding the first byte that & 0x80 and this can be done on modern non-1980s space invaders CPUs in batch operations. My entire point is that I do it in 4 seconds, this lib does it in 24 seconds some 6x the time. It really boils down to that, very simple.

Any fancy shmancy algorithm that process 1 byte a time is by default the worst possible solution unless you want to target Amiga computers with floppy disks. Intel AVX-512 CPUs can process 32 bytes in one go, this lib forces at maximum 1 byte.

Had I taken the time to write an AVX-512 implementation (I won't) it would validate that Swedish page 1 million times in some 1-2 seconds.

Again, it really boils down to numbers.

ghost commented 6 years ago

Also I would suggest to re-align the cursor to nearest batch-length after anything hit 0x80:

This means you have 1 fast execution path where the CPU will simply read a block of memory, AND it, repeat and fallback to utf8-checking only when needed.

I haven't tried the re-alignment thing but with 8 byte as batch length I get down 1 second to 3 seconds instead of 4, which is 8x as fast as this lib.

micnic commented 6 years ago

I updated the algorithm to read more effectively by 4 bytes and avoid reading the buffer out of its limits, it shows a big improvement in node 8 and node 10. Thanks everyone for critics and support, it really helped me find the flows of the code that I wrote and get some better results.

jridgewell commented 6 years ago

Utf-8 validation is all about finding the first byte that & 0x80 and this can be done ... in batch operations

The DFA doesn't prevent batch operations, though. Comparing a batch-optimized if-statement (or SIMD) validator to a non-batch-optimized DFA will make the optimized version win, hands down.

bool isValidUtf8(unsigned char *str, size_t length) {
  for (int i = 0; i < (int) length; i++) {
    if (state == ACCEPT) {
      // Check if (uint32 & 0x80808080) == 0
        // If so, += 4 and continue
      // Fall back to (uint8 & 0x80) == 0
        // If so, += 1 and continue
    }

    // Expecting continuation bytes
    // Use DFA
  }
}

I updated the algorithm to read more effectively by 4 bytes

Reading 4 bytes sequentially is not the same as reading a 4/8/16/32 bytes block at the same time. As @alexhultman said, aligning the pointer reading a block at a time will significantly speed up the algorithm.

Unfortunately, node doesn't expose anything more than a 48bit int read (6 byte block), so we'll never be able to touch native C speed. But, we might be able to beat the overhead of calling into C from JS.

lpinca commented 6 years ago

Reading 4 bytes sequentially is not the same as reading a 4/8/16/32 bytes block at the same time.

This.

Anyway the optimization is based on the assumption that input is made mostly by 1 byte characters, like the English language. That is not always true.

ghost commented 6 years ago

Anyway the optimization is based on the assumption that input is made mostly by 1 byte characters, like the English language. That is not always true.

Picking a batch length does not havew to be static - 1 would be ideal for asian languages while 32 would be ideal for pure ascii. 4 or 8 seems to be good for European languages. If you want to you can dynamically swap execution paths at runtime based on how wrong or right your assumption about the language is.

Failing a 32 byte batch 3 times in a row should lower the batch size and the opposite should increase it (or whatever strategy)

lpinca commented 6 years ago

If you want to you can dynamically swap execution paths at runtime based on how wrong or right your assumption about the language is.

Fair enough but I think it's a bit impractical when you are dealing with multiple languages at the same time. Anyway as you said, an heuristic can be used to change the batch length dynamically.

lpinca commented 6 years ago

FWIW I've just stumbled upon this https://github.com/lemire/fastvalidate-utf-8, it looks good.

ghost commented 6 years ago

Yes my point was to have a heuristic to dynamically change the batch length. It would work for 1 string holding many languages too as it only would change based on current "miss/hit ratio" as it goes along the string.

fastvalidate-utf-8 is really slow, I tested it some days ago and it does 8 seconds where what I use does 3 seconds. It seems he messed up quite a bit in his implementation.

ghost commented 6 years ago

I get 22 seconds with 1.0.5 so it is still really really slow

micnic commented 6 years ago

@alexhultman added 1.0.6 with bytes being read as uint8 - uint32 values, it shows some improved performance

ghost commented 6 years ago

I get 21 seconds with 1.0.6 so you still haven't really fixed the problem. It is still really, really slow.

ghost commented 6 years ago

@jridgewell I honestly don't understand where you got the impression Björn's code was fast? I actually get worse performance with his "optimized" 2010 version. The original one performed better in my tests.

Anyways, I made a quick SSE2 SIMD implementation around Björn's code so to limit the amount of bytes passed through his code. This gets it down to 3 seconds which is not better than what I already have. I get 3.5 seconds with his "optimized" version.

What is interesting to know is that the SIMD base is able to find and isolate the UTF-8 escapes in only 0.6 seconds and this is only using SSE2 (16 bytes a time). So the vast majority of the time is lost in Björn's code and is massively subject to speed-ups.

inline bool isValidUtf8(const unsigned char *src, int length) {
    uint32_t codepoint;
    uint32_t state = 0;

    while(length >= 16) {
        int find_first;
        if (find_first = _mm_movemask_epi8(_mm_loadu_si128((__m128i *) src))) {

            int leading = __builtin_clz(find_first);
            int offset = __builtin_ctz(find_first);
            int num = (32 - leading - offset);

            for (int i = 0; i < num; i++) {
                if (UTF8_REJECT == decode(&state, &codepoint, src[offset + i])) {
                    return false;
                }
            }
        }
        length -= 16;
        src += 16;
    }
    while (length) {
        if (UTF8_REJECT == decode(&state, &codepoint, *src)) {
            return false;
        }
        length--;
        src++;
    }
    return true;
}
ghost commented 6 years ago

If I add a special case for "chunk-isolated-2-long-escapes" with a simple if-case it drops 0.4 seconds going way below 3 seconds. Point is: Björn's code and his idea is always horrible for modern CPUs (which was stated in comment 0).

micnic commented 6 years ago

added version 1.0.7 with some improvements, below are some benchmarks made in OSX and node 10.2.1

Loading https://en.wikipedia.org/wiki/Main_Page ... uv x 13,482 ops/sec ±0.30% (91 runs sampled) utf-8-validate (default, C++) x 14,245 ops/sec ±0.37% (91 runs sampled) utf-8-validate (fallback, JS) x 7,252 ops/sec ±0.29% (94 runs sampled) isutf8 x 7,242 ops/sec ±0.33% (90 runs sampled)

Loading https://ro.wikipedia.org/wiki/Pagina_principală ... uv x 7,396 ops/sec ±0.82% (91 runs sampled) utf-8-validate (default, C++) x 7,757 ops/sec ±0.36% (94 runs sampled) utf-8-validate (fallback, JS) x 4,016 ops/sec ±0.88% (93 runs sampled) isutf8 x 4,069 ops/sec ±0.32% (93 runs sampled)

Loading https://ru.wikipedia.org/wiki/Заглавная_страница ... uv x 4,808 ops/sec ±0.47% (92 runs sampled) utf-8-validate (default, C++) x 5,620 ops/sec ±0.70% (93 runs sampled) utf-8-validate (fallback, JS) x 2,904 ops/sec ±0.67% (92 runs sampled) isutf8 x 2,990 ops/sec ±0.48% (92 runs sampled)

Loading https://ar.wikipedia.org/wiki/الصفحة_الرئيسية ... uv x 4,588 ops/sec ±1.54% (84 runs sampled) utf-8-validate (default, C++) x 5,578 ops/sec ±0.93% (91 runs sampled) utf-8-validate (fallback, JS) x 2,931 ops/sec ±0.80% (90 runs sampled) isutf8 x 2,913 ops/sec ±0.90% (89 runs sampled)

Preparing 16MB of random ASCII bytes (best case) uv x 52.36 ops/sec ±5.66% (54 runs sampled) utf-8-validate (default, C++) x 57.89 ops/sec ±1.77% (60 runs sampled) utf-8-validate (fallback, JS) x 29.07 ops/sec ±1.50% (51 runs sampled) isutf8 valid buffer x 29.85 ops/sec ±2.13% (53 runs sampled)

Preparing all valid UTF-8 bytes ~4.17 MB (worst case) uv x 137 ops/sec ±0.79% (76 runs sampled) utf-8-validate (default, C++) x 308 ops/sec ±0.48% (85 runs sampled) utf-8-validate (fallback, JS) x 184 ops/sec ±0.91% (83 runs sampled) isutf8 valid buffer x 145 ops/sec ±0.48% (81 runs sampled)

ghost commented 6 years ago

I get 18 seconds with 1.0.7

girng commented 5 years ago

Thanks everyone for critics and support, it really helped me find the flows of the code that I wrote and get some better results.

"everyone"? You mean @alexhultman? lol

micnic commented 5 years ago

@girng yes, sure, he actually motivated me to improve the algorithm by showing the weak spots

girng commented 5 years ago

@micnic Alex Hultman is a gem. He's a bit wild and hot headed, but smart af. Also, pretty hilarious sometimes lol

micnic commented 5 years ago

Added version 1.0.8, completely rewritten, with improved performance, below are some benchmarks made in Windows 10 and node 12.8.0

Loading https://en.wikipedia.org/wiki/Main_Page ... uv x 28,186 ops/sec ±0.78% (93 runs sampled) utf-8-validate (default, C++) x 20,638 ops/sec ±1.37% (87 runs sampled) utf-8-validate (fallback, JS) x 13,742 ops/sec ±0.69% (97 runs sampled) isutf8 x 13,712 ops/sec ±0.63% (98 runs sampled)

Loading https://ro.wikipedia.org/wiki/Pagina_principală ... uv x 9,135 ops/sec ±1.13% (93 runs sampled) utf-8-validate (default, C++) x 7,835 ops/sec ±0.75% (91 runs sampled) utf-8-validate (fallback, JS) x 5,516 ops/sec ±0.23% (95 runs sampled) isutf8 x 5,408 ops/sec ±0.83% (96 runs sampled)

Loading https://ru.wikipedia.org/wiki/Заглавная_страница ... uv x 10,195 ops/sec ±0.82% (91 runs sampled) utf-8-validate (default, C++) x 10,335 ops/sec ±1.02% (90 runs sampled) utf-8-validate (fallback, JS) x 6,648 ops/sec ±0.84% (93 runs sampled) isutf8 x 6,668 ops/sec ±1.03% (91 runs sampled)

Loading https://ar.wikipedia.org/wiki/الصفحة_الرئيسية ... uv x 7,742 ops/sec ±0.79% (96 runs sampled) utf-8-validate (default, C++) x 8,387 ops/sec ±0.91% (92 runs sampled) utf-8-validate (fallback, JS) x 5,434 ops/sec ±0.61% (95 runs sampled) isutf8 x 5,394 ops/sec ±1.02% (95 runs sampled)

Preparing 256B of random ASCII data uv x 7,479,924 ops/sec ±1.19% (91 runs sampled) utf-8-validate (default, C++) x 4,174,850 ops/sec ±1.10% (93 runs sampled) utf-8-validate (fallback, JS) x 3,698,783 ops/sec ±0.95% (93 runs sampled) isutf8 x 3,650,841 ops/sec ±1.35% (91 runs sampled)

Preparing 1KB of random ASCII data uv x 2,011,509 ops/sec ±0.85% (96 runs sampled) utf-8-validate (default, C++) x 1,228,480 ops/sec ±0.92% (92 runs sampled) utf-8-validate (fallback, JS) x 971,280 ops/sec ±0.55% (97 runs sampled) isutf8 x 974,207 ops/sec ±0.24% (94 runs sampled)

Preparing 64KB of random ASCII data uv x 33,390 ops/sec ±0.67% (93 runs sampled) utf-8-validate (default, C++) x 23,773 ops/sec ±1.34% (86 runs sampled) utf-8-validate (fallback, JS) x 15,602 ops/sec ±1.28% (95 runs sampled) isutf8 x 15,734 ops/sec ±0.82% (96 runs sampled)

Preparing 1MB of random ASCII data uv x 2,104 ops/sec ±0.69% (97 runs sampled) utf-8-validate (default, C++) x 1,786 ops/sec ±1.31% (89 runs sampled) utf-8-validate (fallback, JS) x 984 ops/sec ±0.68% (94 runs sampled) isutf8 x 979 ops/sec ±1.03% (95 runs sampled)

Preparing 4MB of random ASCII bytes uv x 525 ops/sec ±0.79% (93 runs sampled) utf-8-validate (default, C++) x 465 ops/sec ±1.23% (88 runs sampled) utf-8-validate (fallback, JS) x 248 ops/sec ±0.40% (87 runs sampled) isutf8 x 247 ops/sec ±0.64% (87 runs sampled)

Preparing all valid UTF-8 bytes ~4.17 MB uv x 140 ops/sec ±0.06% (79 runs sampled) utf-8-validate (default, C++) x 404 ops/sec ±0.37% (91 runs sampled) utf-8-validate (fallback, JS) x 250 ops/sec ±0.18% (90 runs sampled) isutf8 x 191 ops/sec ±0.33% (87 runs sampled)