More sixel renderer optimizations.

ismail-yilmaz / upp-components

A collection of packages for U++ framework.

Other

36 stars 2 forks source link

More sixel renderer optimizations. #13

Open ismail-yilmaz opened 3 years ago

ismail-yilmaz commented 3 years ago

Hello @jerch,

I don't want to pollute your repos or other active discussion threads with my findings until I can come up with some practical optimization suggestions for your renderer. So I'll put this here.

I've been doing some heavy profiling on my SixelRenderer since last week. Here are some preliminary results:

I identified both some bottlenecks in my code and the performance sensitive critical parts of sixel rendering in general.
I achieved some nice progress by some simple tweaks since last week:
- I have managed to sequeze out a steady 42 MiB/s (2.90 secs) (parsing + rendering) on your sixel-bench animation, which means ~12 MiB/s gain. for me
- There is a small performance penalty associated with displaying the animation due to the gui event handling (latency) of our GUI toolkit, But my SixelViewer app (a small testing and benchmarking app) can display the full sixel-bench video on average 38.00 Mib (3.10 secs.). As you can guess, this does not translate 1:1 to final terminal performance, because other variables also effect the vte rendering performance. Still it can achieve slightly higher performance than MLTerm now.

sixel-renderer-optimized

The above improvements are all due to some really simple tricks -one of which I borrowed from you- and the examination of the produced assembly code. I am confident that we can raise the throughput bar even higher, because there is still room for optimization.

In the following days this week I'd like to share my findings (code changes/alterations) with you, which you might find interesting and, hopefully, useful for your c/wasm decoder.

jerch commented 3 years ago

@ismail-yilmaz Wow sounds promising - and ofc I am really curious about these improvements on asm level and whether they also apply to wasm to some degree or could be ported back. Its pure stack machine design makes some optimizations behave worse (even pointer arithmetic tends to run slower from my findings).

jerch commented 3 years ago

@ismail-yilmaz Minor note on what I've found worth looking into, but did not get down to it yet:

The sixel bit deconstruction with branching is actually much more expensive than always writing something somewhere. With more clever index "routing" like writing empty bits to dummy pixels I can gain ~30% speedup in put_single. This somewhat surprised me, would have thought that the memory access to dummy pixels would outweigh the branching + bit testing by far. But seems the branching creates a heavy load here.
Imho the conditional checks, if the byte at hand is a sixel (eg. in range 63 to 126), can be made much faster with clever bit shifting. Did only profile that one so far, it adds up to quite some runtime, if the sixel data contains only few compression directives.

Both ideas are somewhat branching related, and I think that is promising ground for further optimizations.

Edit: Btw my decoder function kinda invites to use computed gotos, sadly they are not yet natively supported in wasm. If you want to squeeze 10 - 20% more perf, this might be worth a look as well.

jerch commented 3 years ago

@ismail-yilmaz I implemented the "write always something" in put_single by offsetting the canvas pixels and routing empty bits to canvas[0]. Boosted the raw decoding throughput from ~85 MB/s to 95-100 MB/s for my test images (~15% boost). Still not sure, why it shows that much of a speedup, could be a cache side effect - wild theory: I have the data chunk for reading right before that in memory, maybe it gets always loaded into cache due to this "accidental" locality.

ismail-yilmaz commented 3 years ago

@jerch ,

implemented the "write always something"

I apologize for my late reply (Really busy now, and I'll comment on the things you've written, tonight - GMT+3). But I think I finally understand why you get so high results (which is very impressive by the way, congrats!) When I do the same thing, the performance of my decoder falls down to ~25 MiBs (from 42). I now know why (at least, I know what's the main culprit. I'll write the whole thing tonight (GMT + 3) and share some findings, with a suggestion on your decoder's main loop.

Best regards, İsmail

jerch commented 3 years ago

I apologize for my late reply ...

Well wouldn't call 30 min "late" :smile_cat:

Yes I am really interested in your asm findings and whether I can backport some ideas to wasm as well. Wasm itself is pretty reduced in op codes and its stack mechanics, so some things might not work out there.

When I do the same thing, the performance of my decoder falls down to ~25 MiBs (from 42).

Oh, thats weird. Note that those numbers are raw decoder speed, the whole turnover in xterm.js with the sixel-bench test is just at 36 MB/s (gets slowed down by the JS handling pre/posthand). Furthermore my test machine is a i7-2760QM with 2.40GHz and 1333 MHz (0.8 ns) double bank memory. The CPU is kinda old, but I remember seeing big differences in i3/i5 types vs. i7, esp. due their different cache settings. If we want comparable metrics across machines, maybe we should switch to cycles instead.

ismail-yilmaz commented 3 years ago

@jerch

A disclaimer: My experience with, and knowledge of, webassembly is pretty limited at the moment. (Not so with Asm/C/C++ though. )

I will be using the sixel-bench as my main reference here. And the machine I use for testing is old enough Athlon FX 6100 (six core) @3.3 GHz, 8MB, dual channel, @1866 Mhz.

My findings suggest that there are three critical sections that impact the sixel rendering performance. Let's go step by step, shall we?

1. Parameter collection

This was the first interesting part. The fact that higher color palette settings (>= 256) and sixel compression relies on a supplied parameters list makes this section a good candidate for optimization. Throughout my tests and benchmarking I came to the conclusion that for effective cpu cache utilization it would be better to collect the consecutive parameters in a single, separate loop at once (i.e via a dedicated inline function for better instruction and data cache utilization.) To this end I've created three versions of ReadParams() function, each with a slight modification and run them all 100 times on a 10 MB file that contains random and valid sixel parameters ranging from 0 to 999999.

The code of the said three function can be inspected here. (Note that I use Upp::Stream instead of std::stringstream on the actual code, which is faster and its Peek() and Get() methods are trivially inlineable.)

Here are the results (with -O3 optimization):

// TIMING     - readparams_plain    :  2.97 s  - 29.71 ms ( 2.97 s  / 100 ), min: 29.00 ms, max: 32.00 ms, nesting: 0 - 100
// TIMING     - readparams_faster   :  1.94 s  - 19.37 ms ( 1.94 s  / 100 ), min: 19.00 ms, max: 27.00 ms, nesting: 0 - 100
// TIMING     - readparams_unrolled :  1.72 s  - 17.25 ms ( 1.73 s  / 100 ), min: 17.00 ms, max: 18.00 ms, nesting: 0 - 100

// THROUGHPUT - readparams_plain    :  3.40 MiB/s
// THROUGHPUT - readparams_faster   :  5.20 MiB/s
// THROUGHPUT - readparams_unrolled :  5.80 MiB/s

While the difference between faster and unrolled version can be considered negligible, it is still there. And it also shows that conditional branching may not always be the worst path to take. Which brings me to the question: Does this also apply to wasm? Did you already consider this approach for your decoder.cpp? Because you seem to be handling the parameters on the main loop level.

A side note: A code line from decoder.cpp written in two different ways (CLANG-webassembly, -O3)

else if (code > 62 && code != 127) // original version.
        local.get       3
        i32.const       63
        i32.lt_u
        br_if           0
# %bb.32:               
        local.get       3
        i32.const       127
        i32.eq  
        br_if           0

else if (unsigned(code - 63) < 64) // My altered version
        local.get       3
        i32.const       -63
        i32.add 
        local.tee       2
        i32.const       63
        i32.gt_u
        br_if           0

Now, I don't know which one would be faster as I don't have in-depth knowledge of wasm, so please feel free to correct me at every level, but the fact is this code used for put_pixel is on a very perf-sensitive path, and the altered version seems at least shorter and involves less branching (I don't know how many cycles each instruction takes on wasm) so it may have an impact on the performance - albeit small.

To be continued...

jerch commented 3 years ago

@ismail-yilmaz Thx a bunch, I'll check these things out.

About critical sections - I kinda have a hard time to profile it in wasm, I simply dont know any tools, that can achieve it within wasm. So I went the faulty route of deactivating code paths and comparing the runtimes, which ofc carries the risk of creating bigger dead code branches, that get eliminated by the compiler. Still my findings are these in wasm:

only inner state handling (deactivated sixel, cursor, params and color processing): 150 - 200 MB/s
all minus sixel processing (bit decomposition + pixel safe): 95 - 180 MB/s
all minus params and color processing: 140 - 190 MB/s
all: 93 - 165 MB/s

Note that the big spread in the runtimes is a result of one test image with 4096 colors and random pixels. This image has almost no compression, thus very high throughput numbers. Normal images are grouped around the lower number with only small variations (< +20 MB/s).

Interpretation of the numbers:

Sixel processing and pixel save accounts for ~40% of the runtime. This is actually lower than I expected.
Params and color processing accounts for ~7 % of the runtime. No surprise here, it is within my expectations with slightly worse runtime ratio compared to data ratio, thus slower in processing than the sixel path. This could be further optimized, but the gains are unlikely to surpass 5% overall runtime.
Most runtime is spent for the inner state handling with ~50% runtime. WTH, the inner state handling only switches branches based on ps.state. Have to admit, that the branches are quite scattered/nested with that weird jump helper, still my expectations were in the 15 - 30% range, not 50%.

So I gonna try to find a simpler state model, as it promises great savings. My original JS implementation uses a DFA switching on every input byte. While this is fairly equal in JS, it actually performs worse in wasm. The current wasm state model is a direct transfer of the DFA states into switch and if-else conditions with less switching per byte but deeper nested side paths (the jump helper results from preserving the DFA states, but might not be needed everywhere). A first dig into the branching direction shows, that the switch statements might be problematic. They get compiled to br_table instructions in wasm, which dont seem to get the jump table optimization in the final machine code created by the wasm runtime (needs more investigation, did only a quick test with different amount of cases, which revealed a O(n) dependency, bummer).

Edit: Fixed the mixed numbers above.

ismail-yilmaz commented 3 years ago

@jerch ,

I will hopefully write the second and third parts this weekend, and try to clear up some points with the numbers and profiling. But let me say that I found the culprit in my decoder and changing a single thing skyrockets my decoders throughput to 61 Mib/s (on `sixel-bench) But it is costly, and not really ideal for my decoder's use cases.

I gonna try to find a simpler state model, as it promises great savings.

This is my main loop, by the way:

            for(;;) {
                int c = stream.Get();
                switch(c) {
                case 0x21:
                    GetRepeatCount();
                    break;
                case 0x22:
                    GetRasterInfo();
                    break;
                case 0x23:
                    SetPalette();
                    break;
                case 0x24:
                    Return();
                    break;
                case 0x2D:
                    LineFeed();
                    break;
                case -1: // eof or err
                    goto Finalize;
                default:
                    if(check_range(c, 0x3F, 0x7E))
                        PaintSixel(buffer, c - 0x3F);
                    break;
                }
            }

Each function is inlined and the only other two inner loops are in ReadParams() (inlined inside SetPalette, GetRepeatCount, GetRasterInfo) and PaintSixel() Also on sixel bench, ReadParams gets called 17575647 times (total), while the PaintSixel is called 69508135 times (total, and called for both compressed and single sixels), And these numbers do not even account for inner loops of these functions! That's why I think they are very sensitive.

jerch commented 3 years ago

@ismail-yilmaz 61 Mib/s for the full deal with sequence parsing, sixel decoding and painting in the terminal? Wow, thats far beyond what I can achieve (xterm.js input processing alone is slower lol).

May I ask you to run the benchmarks of my lib? I have the feeling, that our machines are quite different in the throughput numbers, sadly I have no clue, how to measure wasm execution in terms of CPU cycles, which would be a better number for comparison.

If you are up to test my wasm decoder I can give you the compiled version, thus you wont need to install the emscripten SDK.

ismail-yilmaz commented 3 years ago

@jerch

61 Mib/s for the full deal with sequence parsing, sixel decoding and painting in the terminal? Wow, thats far beyond what I can achieve (xterm.js input processing alone is slower lol).

Unfortunately no. It is sequence parsing + sixel decoding.

If I display the full animation in the sixel viewer, I get -56 Mib/s. (gui-processing penalty)
If I display the full animation in the terminal via pty -> ~31 Mib/s (where mlterm gets 23 Mib/s)
If I dispaly the full animation in the terminal, by direclty feeding to it, -> ~40 Mib/s.

The drop has several reasons. But the most important reason is I have to stick with the gui calls and rules provided by our framework to ensure backend compatibility, as our vte is a widget, not an app. But the raw throughput (when only displaying the image is disabled) for sixel-bench in our vte is around 44 Mib/s. So it is pretty much optimized for its environment.

But this "extremely fast" rendering is not feasible for me because I need to use a static buffer with fixed length to get this. You can see where I'm getting at.. :))

It appears that 42-44 Mib/s range I've recently achieved is the limit for my decoder unless a fundamental change in my assumptions and decoder design.

If you are up to test my wasm decoder I can give you the compiled version, thus you wont need to install the emscripten SDK.

Ah, I totally forgot to mention that I already compiled it just two days ago, but did not have time to test it, sorry. However, if you have an additional script to test sixel-bench with yout wasm, that would be very nice. :)

I will run the existing tests and compare the results with mine. (I'll share the results within next week)

I am really interested in optimizing sixel rendering now, and I really like to see xterm.js with top-notch sixel support. Besides, your decoder.cpp already gets a lot of things right. In fact I was planning to build an experimental renderer around it to see how well it fares against mine in the same environment, but I am very busy, so this will take some time...

jerch commented 3 years ago

Ah, I totally forgot to mention that I already compiled it just two days ago, but did not have time to test it, sorry. However, if you have an additional script to test sixel-bench with yout wasm, that would be very nice. :)

I can prepare a script for the sixel-bench test, sure thing.

If you pulled the whole node-sixel repo, you can run its benchmarks with:

cd node-sixel
git checkout faster_decode
# edit wasm/build.sh to point EMSCRIPTEN_PATH to your emsdk_env.sh
# then run
npm install  # will abort, just ignore it
npm run build-wasm
npm install
# benchmark
node_modules/.bin/xterm-benchmark ./lib/index.benchmark.js

Have not tested that yet on a clean checkout, so bear with me if its not working out of the box. Also my newer optimizations are not yet checked in, thus you will see slightly lower numbers.

Edit: Fixed the checkout/build commands.

ismail-yilmaz commented 3 years ago

I can prepare a script for the sixel-bench test, sure thing.

I'd be grateful, thanks!

I will benchmark it on this sunday.

jerch commented 3 years ago

Damn switch statement - with flat switch statement over byte values I get worse numbers (80 - 150 MB/s), but reshaped into sorted if-else cascades I get slightly higher numbers (100 - 180 MB/s). Lol, not a good sign, if the conditional cascade actually performs better, normally the switch statement should outperform that by far with proper jump table optimization for multiple cases, grrrrr. So the neckbreaker for the current complicated state model seems to be the extensive usage of switch.

Will see, if I can get a simpler branching model from the cascading thingy (still misses several edge case transitions).

jerch commented 3 years ago

This is the best compromise I could find with reduced state handling:

inline void maybe_color() {
  if (ps.state == ST_COLOR) {
    if (ps.p_length == 1) {
      ps.color = ps.palette[ps.params[0] % ps.palette_length];
    } else if (ps.p_length == 5) {
      if (ps.params[1] < 3
        && ps.params[1] == 1 ? ps.params[2] <= 360 : ps.params[2] <= 100
        && ps.params[3] <= 100
        && ps.params[4] <= 100) {
        switch (ps.params[1]) {
          case 2:  // RGB
            ps.palette[ps.params[0] % ps.palette_length] = ps.color = normalize_rgb(
              ps.params[2], ps.params[3], ps.params[4]);
            break;
          case 1:  // HLS
            ps.palette[ps.params[0] % ps.palette_length] = ps.color = normalize_hls(
              ps.params[2], ps.params[3], ps.params[4]);
            break;
          case 0:  // illegal, only apply color switch
            ps.color = ps.palette[ps.params[0] % ps.palette_length];
        }
      }
    }
  }
}

void decode(int length) {
  if (ps.not_aborted && ps.y_offset < ps.height) {
    for (int i = 0; i < length; ++i) {
      int code = ps.chunk[i] & 0x7F;
      if (62 < code && code < 127) {
        switch (ps.state) {
          case ST_COMPRESSION:
            put(code - 63, ps.color, ps.params[0]);
            ps.cursor += ps.params[0];
            ps.state = ST_DATA;
            break;
          case ST_COLOR:
            maybe_color();
            ps.state = ST_DATA;
          default:
            put_single(code - 63, ps.color);
            ps.cursor++;
        }
      } else if (47 < code && code < 58) {
        params_add_digit(code - 48);
      } else
      switch (code) {
        case 59:
          params_add_param();
          break;
        case 33:
          maybe_color();
          params_reset();
          ps.state = ST_COMPRESSION;
          break;
        case 35:
          maybe_color();
          params_reset();
          ps.state = ST_COLOR;
          break;
        case 36:
          ps.cursor = 0;
          break;
        case 45:
          ps.y_offset += 6;
          ps.offset = ps.y_offset * ps.width + 8;
          ps.cursor = 0;
          break;
        case 34:
          maybe_color();
          params_reset();
          ps.state = ST_ATTR;
          break;
      }
    }
  }
}

While if-cascades are slightly faster in wasm, they kinda make the code alot uglier. Furthermore I dont want to optimize for shortcomings of wasm runtimes (only tested it with nodeJS so far), if they have a hard time to opt for real jump tables, they better get that fixed imho. Throughput numbers are between 92 - 180 MB/s, binary size dropped from ~6KB to ~3.8KB.

maybe_color still needs some cleanup (just copied it over), also the reduced state handling needs extensive testing.

Well it seems I cannot find a significantly faster version in wasm even with reduced state model, and I dont think that the states can be further reduced without introducing faulty handling. I did not yet try your suggested optimizations above.

ismail-yilmaz commented 3 years ago

@jerch

In the meantime, some preliminary benchmarks, including wasm:

(Sixel tests, on my development decoder, the one that gets`~42 Mib/s on sixel-bench)

Clean decoding, 20 rounds.
SixelRenderer_Clean(sem_clean.six):                 1.40ms   (64.00 MB/s)
SixelRenderer_Clean(boticelli_clean.six):           1.80ms   (29.00 MB/s)
SixelRenderer_Clean(screen_clean.six):              2.50ms   (21.00 MB/s)
SixelRenderer_Clean(test2_clean.sixel):             4.50ms   (70.00 MB/s)
SixelRenderer_Clean(sampsa1_clean.sixel):           30.00m   (66.00 MB/s)
SixelRenderer_Clean(test1_clean.sixel):             18.00ms  (33.00 MB/s)
SixelRenderer_Clean(rose16_clean.six):              0.70ms   (42.00 MB/s)
SixelRenderer_Clean(fullhd_12bit_noise_clean.six):  120.00ms (131.40 MB/s)
SixelRenderer_Clean(oriole_clean.six):              1.10ms   (9.90 MB/s)
SixelRenderer_Clean(cat2_clean.six):                0.70ms   (7.40 MB/s)
SixelRenderer_Clean(zx81_clean.six):                0.45ms   (1.10 MB/s)
SixelRenderer_Clean(leonardo_clean.six):            1.10ms   (13.00 MB/s)
SixelRenderer_Clean(michael_clean.six):             1.20ms   (17.00 MB/s)
SixelRenderer_Clean(trontank_clean.six):            1.10ms   (15.00 MB/s)
SixelRenderer_Clean(sampsa_reencoded_clean.six):    20.00ms  (31.00 MB/s)
SixelRenderer_Clean(time_clean.six):                3.70ms   (42.00 MB/s)
SixelRenderer_Clean(gnuplot_clean.six):             0.95ms   (11.00 MB/s)
SixelRenderer_Clean(biplane_clean.six):             1.50ms   (24.00 MB/s)
SixelRenderer_Clean(demo05_clean.six):              1.00ms   (34.00 MB/s)
SixelRenderer_Clean(testhlong_clean.six):           0.45ms   (0.07 MB/s)
SixelRenderer_Clean(chess_clean.six):               1.40ms   (17.00 MB/s)
SixelRenderer_Clean(oriole2_clean.six):             1.70ms   (19.00 MB/s)
SixelRenderer_Clean(space_clean.six):               0.85ms   (9.70 MB/s)

"Fringe" decoder 61 Mib/ss version. ( static RGBA buffer with fixed size of 1920x1080. Yeah, kinda insane. But this can happen when we make compilers really happy. But making compilers happy can make life miserable for a developer. Hence this decoder is not really usable.

Clean decoding, 20 rounds.
SixelRenderer_Clean(sem_clean.six):                 0.90ms  (99.97 MB/s)
SixelRenderer_Clean(boticelli_clean.six):           0.70ms  (73.60 MB/s)
SixelRenderer_Clean(screen_clean.six):              0.70ms  (72.50 MB/s)
SixelRenderer_Clean(test2_clean.sixel):             2.20ms  (143.40 MB/s)
SixelRenderer_Clean(sampsa1_clean.sixel):           12.00ms (168.40 MB/s)
SixelRenderer_Clean(test1_clean.sixel):             3.50ms  (168.70 MB/s)
SixelRenderer_Clean(rose16_clean.six):              0.55ms  (52.91 MB/s)
SixelRenderer_Clean(fullhd_12bit_noise_clean.six):  74.00ms (210.30 MB/s)
SixelRenderer_Clean(oriole_clean.six):              0.50ms  (21.73 MB/s)
SixelRenderer_Clean(cat2_clean.six):                0.45ms  (11.53 MB/s)
SixelRenderer_Clean(zx81_clean.six):                0.45ms  (1.15 MB/s)
SixelRenderer_Clean(leonardo_clean.six):            0.50ms  (28.75 MB/s)
SixelRenderer_Clean(michael_clean.six):             0.55ms  (35.07 MB/s)
SixelRenderer_Clean(trontank_clean.six):            0.50ms  (32.82 MB/s)
SixelRenderer_Clean(sampsa_reencoded_clean.six):    3.40ms  (189.30 MB/s)
SixelRenderer_Clean(time_clean.six):                1.30ms  (118.20 MB/s)
SixelRenderer_Clean(gnuplot_clean.six):             0.45ms  (22.98 MB/s)
SixelRenderer_Clean(biplane_clean.six):             0.60ms  (59.01 MB/s)
SixelRenderer_Clean(demo05_clean.six):              0.65ms  (51.93 MB/s)
SixelRenderer_Clean(testhlong_clean.six):           0.45ms  (0.07 MB/s)
SixelRenderer_Clean(chess_clean.six):               0.55ms  (41.01 MB/s)
SixelRenderer_Clean(oriole2_clean.six):             0.65ms  (48.84 MB/s)
SixelRenderer_Clean(space_clean.six):               0.45ms  (18.35 MB/s)

And wasm on my base testing machine which I mentioned in my earlier messages:

   Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.11 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.20 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.91 ms
            Case "decodeString" : 20 runs - average runtime: 5.32 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.86 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.84 ms
            Case "decodeString" : 20 runs - average runtime: 2.26 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.89 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 19.23 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 32.78 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.75 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 40.76 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.61 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.66 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 234.74 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 66.26 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 10.17 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 58.24 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 4.36 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 72.63 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 11.45 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 56.30 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 124.64 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 124.38 MB/s

and perf on sixel-bench

       Output:
       SixelRenderer(data.sixel):          2.90s (42.00 MB/s)
       SixelRenderer(data.sixel):          2.80s (42.00 MB/s)
       SixelRenderer(data.sixel):          2.80s (42.00 MB/s)
       SixelRenderer(data.sixel):          2.80s (42.00 MB/s)

       Performance counter stats for './SixelBench' (4 runs):

      3.214,11 msec task-clock:u              #    0,998 CPUs utilized            ( +-  0,23% )
             0      context-switches:u        #    0,000 K/sec                  
             0      cpu-migrations:u          #    0,000 K/sec                  
         4.940      page-faults:u             #    0,002 M/sec                    ( +-  0,03% )
11.763.759.200      cycles:u                  #    3,660 GHz                      ( +-  0,10% )  (83,12%)
   837.240.854      stalled-cycles-frontend:u #    7,12% frontend cycles idle     ( +-  0,26% )  (83,36%)
 2.512.438.090      stalled-cycles-backend:u  #   21,36% backend cycles idle      ( +-  1,25% )  (33,52%)
14.418.053.773      instructions:u            #    1,23  insn per cycle         
                                              #    0,17  stalled cycles per insn  ( +-  0,53% )  (50,13%)
 3.225.163.249      branches:u                # 1003,440 M/sec                    ( +-  0,17% )  (66,73%)
   112.190.910      branch-misses:u           #    3,48% of all branches          ( +-  0,35% )  (83,26%)

       # Table of individual measurements:
       3,24364 (+0,02208) #
       3,22007 (-0,00148) #
       3,20740 (-0,01416) #
       3,21511 (-0,00644) #

       # Final result:
       3,22156 +- 0,00781 seconds time elapsed  ( +-  0,24% )

I'll continue to explore the optimization strategies, part 2, with some more findings, tomorrow. Have a nice weekend.

ismail-yilmaz commented 3 years ago

@jerch ,

Another treat: A modified version of your wasm decoder (I adapted the logic of mine to yours. Seems slightly faster, but contains less checking.) You can compare it with the above benchmarks. (Warning: the patch code is ugly - in C++ standards, at least..)

Results:

 Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.20 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.29 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 5.01 ms
            Case "decodeString" : 20 runs - average runtime: 5.43 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.77 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.66 ms
            Case "decodeString" : 20 runs - average runtime: 2.17 ms
          Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.77 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 19.07 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 33.13 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.75 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 40.76 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.69 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.51 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 236.52 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 65.68 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 10.34 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 57.31 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 4.33 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 73.11 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 11.73 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 54.91 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 115.23 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 134.54 MB/s

Please find attached the modified code.. decoder.zip

ismail-yilmaz commented 3 years ago

@jerch

I've made some corrections to my patch and as a result it seems a bit faster than previous:

  Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.16 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.22 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.77 ms
            Case "decodeString" : 20 runs - average runtime: 5.59 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.84 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.59 ms
            Case "decodeString" : 20 runs - average runtime: 2.08 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.83 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 19.10 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 33.12 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.65 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 41.38 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.68 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.54 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 250.48 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 62.10 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 10.38 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 57.28 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 4.30 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 73.60 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 11.83 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 54.51 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 111.23 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 139.39 MB/s

The code can be easily modified and unrolled, which might let you gain even faster speeds, and test the assembly behavior for your final decoder.

decoder_corrected.zip

jerch commented 3 years ago

@ismail-yilmaz Hmm weird, with your last code I get slightly worse runtimes than with my lastest optimizations (still have to cleanup and push that). Btw in line 187 you do this:

        for(int n = 0; n < 6; ++n) {
            if(c & (1 << n))
                ps.canvas[p + ps.jump_offsets[n]] = ps.color;
        }

Replaced with:

        for(int n = 0; n < 6; ++n) {
                ps.canvas[((c >> n) & 1) * (p + ps.jump_offsets[n])] = ps.color;
        }

I get much higher throughput. It is again the "write always something" trick, can you try if this gives you higher numbers as well? (Because you said above that this is worse for you, so I am not sure, if it misuses some CPU cache settings).

Btw my numbers are constantly higher than yours, so your 42-ish number is more like 65 for my machine. Maybe CPU and bus speed differences can explain most of that, not sure if there are bigger differences in cache loading times between AMD and Intel, I also think that the different pipeline lengths will equal out in the end (well I have not really an informed opinion in that field anymore, lost interest in CPU differences around Pentium 4, which was known to have insane long pipelines).

ismail-yilmaz commented 3 years ago

@jerch ,

Sure (not much difference here, interesting...):

  Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.07 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.19 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.62 ms
            Case "decodeString" : 20 runs - average runtime: 5.38 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.74 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.58 ms
            Case "decodeString" : 20 runs - average runtime: 2.04 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.98 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 19.20 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 32.96 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.88 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 40.08 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.51 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.83 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 243.80 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 63.77 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 10.55 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 56.14 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 4.35 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 72.80 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 12.24 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 52.60 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 111.89 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 138.63 MB/s

(Because you said above that this is worse for you, so I am not sure, if it misuses some CPU cache settings).

Well, for one, it is because you are using a static integer array with fixed size. Compilers usually vectorize the hell out of them (on x86 arch). I allocate the memory on heap (resizeable) and use aligned RGBA structure. integer operations can use registers. Setting a rgba buffer (even their length be the same) is usually done using memcpy. That's why the performance of my renderer takes hit if I "always write" .

Here's a synthetic benchmark stressing the difference between RGBA/integer and static vs dynamic allocation:

OP         - 500 rounds of random color fill.
BUFFERSIZE - 1920 * 1080,
THROUGHPUT - Buffer<RGBA>  : 56.00 MiB/s
THROUGHPUT - Buffer<int>   : 58.00 MiB/s
TIMING     - Buffer<int>   : 17.06 s  - 34.12 ms (17.06 s  / 500 ), min: 33.00 ms, max: 42.00 ms, nesting: 0 - 500
TIMING     - Buffer<RGBA>  : 17.69 s  - 35.38 ms (17.69 s  / 500 ), min: 34.00 ms, max: 49.00 ms, nesting: 0 - 500

OP         - 100000 rounds of random fill.
BUFFERSIZE - 1920 * 1080,
THROUGHPUT - RGBA   buffer[]  : 5700.00 MiB/s
THROUGHPUT - uint32 buffer[]  : 6200.00 MiB/s
TIMING     - RGBA   buffer[]  :  7.72 ms - 77.20 ns (99.00 ms / 100000 ),  min:  0.00 ns, max:  1.00 ms, nesting: 0 - 100000
TIMING     - uint32 buffer[]  :  9.72 ms - 97.20 ns (101.00 ms / 100000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 100000

if there are bigger differences in cache loading times between AMD and Intel

Possibly, because it appears that Athlon FX 6100 is somewhere in between i5 and i7, closer to i5.

ismail-yilmaz commented 3 years ago

@jerch Oops, I've made a mistake. Here are the real results with always write:

  Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.13 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.23 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.96 ms
            Case "decodeString" : 20 runs - average runtime: 5.56 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.87 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.51 ms
            Case "decodeString" : 20 runs - average runtime: 2.08 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 28.15 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 18.95 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 33.44 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.76 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 40.72 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.73 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.40 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 235.48 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 65.95 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 8.88 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 66.80 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.80 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 83.18 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 9.37 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 68.81 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 107.65 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 144.03 MB/s

jerch commented 3 years ago

@ismail-yilmaz Pushed my latest optimizations, which is actually the fastest with all proper state transitions. Thats the one where I get >90 MB/s for all tests of the benchmark, and ~170MB/s for the 12bit noise image (which should not be taken too serious, as it is "very opinionated" in its data). The reduced state model is also contained as decode_ (so you flip and test as well).

Edit: Ah I see, your last numbers are more like mine only showing a small gap now.

jerch commented 3 years ago

Yes the static nature is much easier to deal with, but thats all I need for the wasm instance. The level to get multiple instances here is the wasm container itself, thus one would just spawn multiple wasm decoders if needed.

Ofc with the pixels on the heap you can not get the pseudo cache locality from chunk bytes, as it could be allocated elsewhere (and the chunk bytes prolly live elsewhere as well).

ismail-yilmaz commented 3 years ago

Yes the static nature is much easier to deal with, but thats all I need for the wasm instance. The level to get multiple instances here is the wasm container itself, thus one would just spawn multiple wasm decoders if needed.

Yeah, that's, unfortunately, not really affordable for me That's why I'd better explore other options...

jerch commented 3 years ago

I did not like the static memory in the first place, because it creates tons of frictions for the higher level JS integration. But emscripten currently does not allow "memory.grow" for direct wasm builds (its simply not yet implemented lol), thus removing the allocators and going fully static was the easiest way for me to overcome the shortcomings. And since it shows better performance I am not really mad about it and will reshape the JS integration to suit that model.

Edit: But cant you do something alike - like allocating a bigger area once and do the memory semantics on your own within that area? I remember doing that for a game AI once, where I misused a bigger portion of stack memory with alloca resulting in a real big performance jump during the negamax descent. Ofc the stack is not suitable here, but maybe some own byte sink on the heap...

ismail-yilmaz commented 3 years ago

@jerch

I'll update the node-sixel shortly and report the results later today.

Edit: But cant you do something alike - like allocating a bigger area once and do the memory semantics on your own within that area? I remember doing that for a game AI once, where I misused a bigger portion of stack memory with alloca resulting in a real big performance jump during the negamax descent. Ofc the stack is not suitable here, but maybe some own byte sink on the heap...

Of course. I already did some "cool/clumsy" tricks that "just work" and failed miserably at others, but they were not feasible in general, because the real problem is that our vte is a widget and it can, and should continue to, work on a variety of devices and backends, ranging from liimited hobby hardware and rasberry Pi to hi-perf desktops, on SDL/linuxfb and even in web browsers. That's another reason why I find your work on wasm and opinions important and very productive for me. You see, I am also the co-author and maintainer of our HTML5 backend which allows apps that use our framework to run inside web-browsers (canvas + websockets). It is called Turtle. As its name suggests, it uses javascript :)) Thus I am also exploring the possibilities that webassembly can offer us. After all our HTML backend boils down to image and input handling which I believe can be satisfyingly optimized using webassembly.

jerch commented 3 years ago

Oh I see, well thats interesting. What JS engine do you use for that? If you want to go that route we prolly should share more general ideas about wasm and JS interactions.

I did some more testing before getting down to a wasm sixel decoder implementation, mainly around nasty tasks in xterm.js like base64 and utf8 decoding. Wasm beats all of my JS implementations by far (at least 2 times faster), and those are already among the fastest in JS land (all tested on V8). Furthermore - calling into wasm creates almost no overhead, if done right. Currently I think that wasm can be used to replace performance critical JS sections, even if called over and over on a hot path. But note that I have only limited experience with other JS/wasm engines (did only surface tests with Firefox/spidermonkey and WebKit/JSScriptCore - both are a bit slower than V8 from what Ive seen so far).

ismail-yilmaz commented 3 years ago

@jerch

Oh I see, well thats interesting. What JS engine do you use for that? If you want to go that route we prolly should share more general ideas about wasm and JS interactions.

I did some more testing before getting down to a wasm sixel decoder implementation, mainly around nasty tasks in xterm.js like base64 and utf8 decoding. Wasm beats all of my JS implementations by far (at least 2 times faster), and those are already among the fastest in JS land (all tested on V8). Furthermore - calling into wasm creates almost no overhead, if done right. Currently I think that wasm can be used to replace performance critical JS sections, even if called over and over on a hot path. But note that I have only limited experience with other JS/wasm engines (did only surface tests with Firefox/spidermonkey and WebKit/JSScriptCore - both are a bit slower than V8 from what Ive seen so far).

Well, turtle is pretty lightweight as it does not try to recreate a gui with JS. What it does is simply redirect the app window's output (and an associated background) to a web browser via a simple base64 encoded binary protocol. It gets decoded by javascript and displayed and key + mouse input are tracked. This all happens in a simple loop so it can work on all major browsers. So all it does is some image processing and blitting partial or complete images from a server application to a client web browser capable of canvas and web sockets. That's why I think it can be easily optimized using webassembly (I may be dead wrong of course.) To give you a clear idea of what I am talking about, here is an example. (Note that this is two years ago and neither TerminalCtrl nor turtle are as slow as this now.)

But then again, I don't really want to pollute further our sixel and wasm optimization discussion wih other stuff. Later when I implement this I'd love to discuss this and learn more about webassembly from your experience with it.

ismail-yilmaz commented 3 years ago

Ok, I downloaded the latest changes and here is the comparison:

decode() - always write

  Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.09 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.19 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.79 ms
            Case "decodeString" : 20 runs - average runtime: 5.44 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.84 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.49 ms
            Case "decodeString" : 20 runs - average runtime: 1.98 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.80 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 18.83 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 33.63 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.64 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 41.38 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.61 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.64 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 243.20 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 63.87 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 8.69 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 68.13 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.83 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 82.61 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 9.43 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 68.33 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 121.10 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 128.01 MB/s

decode_() - always write:

 Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.24 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.40 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 5.14 ms
            Case "decodeString" : 20 runs - average runtime: 5.55 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 4.23 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.35 ms
            Case "decodeString" : 20 runs - average runtime: 1.97 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 28.42 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 19.35 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 32.53 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.80 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 40.54 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 19.33 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 33.34 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 229.72 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 67.67 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 8.70 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 68.13 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.85 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 82.20 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 8.85 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 72.84 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 136.91 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 113.34 MB/s

my patch (always write)

 Context "./lib/index.benchmark.js"
      Context "testimage"
         Context "pixel transfer"
            Case "toPixelData - with fillColor" : 20 runs - average runtime: 2.14 ms
            Case "toPixelData - without fillColor" : 20 runs - average runtime: 1.17 ms
         Context "decode (DefaultDecoder)"
            Case "decode" : 20 runs - average runtime: 4.90 ms
            Case "decodeString" : 20 runs - average runtime: 5.50 ms
            Case "decode + pixel transfer" : 20 runs - average runtime: 3.86 ms
         Context "decode (WasmDecoder)"
            Case "decode" : 20 runs - average runtime: 1.17 ms
            Case "decodeString" : 20 runs - average runtime: 1.69 ms
         Context "encode"
            Case "sixelEncode" : 20 runs - average runtime: 27.93 ms
      Context "decode - testfiles (DefaultDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 18.97 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 33.33 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 7.48 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 42.28 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 18.59 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 34.67 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 220.23 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 70.46 MB/s
      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 8.41 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 70.54 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.64 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 87.14 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 8.95 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 71.97 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 101.49 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 152.77 MB/s

jerch commented 3 years ago

kk you got me, gonna try your patches with the "always write" patch merged than. Seems to be the fastest for you, if I am not mistaken? Still wondering why I see much more of an advantage for "always write" than you do though...

ismail-yilmaz commented 3 years ago

@jerch

kk you got me, gonna try your patches with the "always write" patch merged than. Seems to be the fastest for you, if I am not mistaken? Still wondering why I see much more of an advantage for "always write" than you do though...

Don't bother, I'd just synced the repo, compiled the decoder and ran it (while you changed the canvas start address, and I missed to update my code.) So it issomewhat false alarm: I checked & fixed the canvas address, Now my patch's performance is somewhere in between your decoders', and is slightly faster than the decode (original).

ismail-yilmaz commented 3 years ago

The major and possibly the only problem with the so-called "write-always" method -which works exceptionally well on static-fixed-sized OR small canvases-is that it does not seem to scale well to larger images thar require dynamic-variable-sized buffers. Writing to buffer[0] becomes much more expensive than branching if the said address is not in the cache line or any level of cpu caches. I think this can be worked-around by introducing alternative PaintSixel methods for larger and smaller buffers + calculating/checking the absolute cursor position by powers of cacheline length. This makes things a little more complicated though.

jerch commented 3 years ago

Hmm sounds reasonable. Would a local dummy pixel help here? Something like sliceA, sliceB, ... - where every slice has its own dummy area at the beginning?

ismail-yilmaz commented 3 years ago

@jerch

Thinking again, my initial reasoning might be faulty and I don't want to mislead youç. Now, I haven't tested this yet, but currenty we're moving along the y axis when drawing a strip of single sixel. So when the buffer(0,[0) is hit, it will probably be in the hot cache for the next round, as we are already in a loop (Probably the reason why always-write is faster in the first place.). OTOH, writing on the beginning of the current line (buffer(0, y +n)), for example, would probably be useful had we been moving along the x axis. Needs some testing and more thinking....

jerch commented 3 years ago

Idk if we should optimize for cache specialties in the end, but it sounds like it is worth to be tried.

Idk at which occasions you have to resize the canvas by what extend (and where the cut points are), from sixel writing perspective imho the smallest contiguous mem area would be a sixel line, thus 6 * width pixels. Such a sixel line could hold at [0] (plus some alignment) dummy pixels, thus we would get better cache locality at least for the first real pixels (for very long lines the cache offset might be too big again, but the static mem has the same issue). Last but not least lines with offsets like that need an additional final compose step to get the full canvas, thus the whole things needs to be profiled, whether the cache benefit outweighs the compose workload. Tbh I doubt that, still it sounds worth trying, as contiguous copying is quite cheap. (Maybe you dont even have to compose it prehand, but just push line pointers forward to compose it during drawing.)

Well, just a wild idea.

Edit: In the original JS decoder I dont write to the final image canvas positions directly, but to sixel bands into x direction like that [a1 ... a6, b1 ... b6]. While this is very fast during streaming, the final compose step is really ugly and expensive (and one reason for my reduced level 2 wasm decoder).

jerch commented 3 years ago

You prolly already know that - here are some interesting tests about cache effects on runtime: http://igoro.com/archive/gallery-of-processor-cache-effects/

jerch commented 3 years ago

This brings me to another idea - an attempt to create a cache friendly encoder + decoder with "local chunks":

Encoder:

slice a sixel line into 16 * 6 pixel chunks (--> 6 cache lines), or maybe bigger cache line multiples
do all pixel color addressing within one chunk (thus all color registers for that chunk with LFs and sixel offset data)
repeat until sixel line is done

Decoder:

alloc local canvas chunk (maybe even on stack)
apply all color directives and LFs
memcpy 6 * chunklength to canvas position
increase canvas position
repeat with all chunks

This model introduces additional LF roundtrips with sixel offset data, but they are rather cheap. The memcpy action will be expensive, since it is the only action into non-cached mem, but thats always a contiguous copy of chunklength, thus not that bad.

Big downside - the decoder would only benefit, if the images are encoded that way. Data size will slightly increase due additional LF and sixel offset data. Whether thats actually faster remains uncertain until actually tried. Big plus - such a model can easily be run concurrently (could even be done on the GPU).

(If I am bored I gonna try such a version, but more for academic purpose...)

ismail-yilmaz commented 3 years ago

You prolly already know that - here are some interesting tests about cache effects on runtime: http://igoro.com/archive/gallery-of-processor-cache-effects/

Thanks, a nice summary indeed. The example #7 is an interesting situation. IMO, however, a generic, cache-friendly decoder would be enough in practice.

This brings me to another idea - an attempt to create a cache friendly encoder + decoder with "local chunks":

This. I really like the idea. In fact I had attempted to write a similar MT decoder -not really as cache-friendly as your suggested enc/dec pair though- some months ago. How it operated was as follows:

Split the sixel lines beforehand and calculate the total number of canvas lines in pixels.
Create a vector of RGBA structures/strips (initial lengrh: 1024) from the given number of sixel lines.
Feed the vector of line buffers into the sixel painter.
Merge the strips into a final image buffer.

It was slighty faster (~2-3 Mib/s) in single threaded mode and a bit more faster in multithreaded mode. But I just discarded the idea, for, at the time, it did not seem to me as a real improvement. This was due to the initial version of my sixel renderers design. It was meant to "just work" and not optimized at all. (was giving ~20 Mib/s at most.) But now, with all the speed gain I achieved, thanks to our conversations and some of your ideas, I may try this again as a side project to see the results.

jerch commented 3 years ago

Woops - I mixed LF and CR above in the model description. Ofc I meant CR to get the next color into the chunk.

ismail-yilmaz commented 3 years ago

Ok, that makes better sense now.

jerch commented 3 years ago

And btw - the memcpy action could be marked as non-cacheable, as the cache friendly sixel format would guarantee to be final on a pixel (not needed anymore for decoding). Would avoid nonsense cache refreshs accidentally dropping other still needed parts.

ismail-yilmaz commented 3 years ago

I've just identified and fixed a bottleneck in my sequence parser. It is both amazing. and embarrassing:

VTInStream(data.sixel):                  0.58s (200.00 MiB/s)
SixelRenderer(data.sixel):               2.10s (56.00 MiB/s) // Formerly ~42 MiB...

lol.

Sixel viewer now displays the animation with ~48-49 (oscillating) MiB/s througput.

I was using a string to collect the DCS, as it was convenient. All i did was to change it to a vector with a bit of (<= 64 k) reserved memory.

jerch commented 3 years ago

@ismail-yilmaz Did a quick impl of the cache friendly sixel format encoder as described above. Well the result is not that promising. With a chunklength of 256 * 6 pixels it changes as follows:

data size increased by 30% (!)
wasm decoder takes 7% longer now (cache optimized decoder is not yet done)

The 30% size increase is surprisingly high (might still be buggy) and results from the CR resets with offset handling at the chunk borders, which is needed to still form valid sixel data. While the decoder runs faster in terms of throughput (100 MB/s vs. 122 MB/s), we have no gain yet. Not sure, if a cache optimized decoder can make up for more than the additional 7% runtime introduced by the bigger data size. Atm the penalty on normal decoders alone makes this approach subpar.

Btw I made the chunklength configurable, so its easier to test at different cache utilizations (width of 256 pixels is at 96 cache lines).

jerch commented 3 years ago

Some interim results of a first cache optimized decoder impl:

initial code changes with local canvas chunk: 10 times slower :sweat_smile:
+simd: 21% slower
+_int128 for memcpy: 17% slower

Currently a big performance dent comes from a simple modulo calc I have to do for every pixel addressing, but that will be gone with a proper full rewrite. Furthermore there is still a logic twist hidden somewhere, that forces me to do a load from canvas back to the local chunk, which should not be needed and creates a rather big penalty. With fixing both I expect that decoder to be much faster than the current fastest one.

Remarkable about the findings above is the fact, that the memcpy action normally is really expensive, but gets hidden in the other CPU load with simd + int128. Not sure yet, why int128 gave another 4% advantage, thought memcpy would get vectorized anyway with simd enabled. Maybe its a shortcoming of the wasm runtime, needs some investigation.

jerch commented 3 years ago

A possible explanation, why "write always" is that much faster (did some intel arch reading, but still far from understanding the details):

The sixel write actions in put_single to 6 different mem cells are non-dependent, thus can be reordered by the CPU. (Btw thats not quite true for the [0] case, maybe more room for optimization, not sure if the reordering will recognize, that always the same value gets written). With that in mind a modern x86 can process multiple micro-ops in parallel at pipeline level, thus "parallelizes" those writes to some degree. With the conditional checks in place (not "write always") that parallalism gets wastes with speculative branching with a bad retired instruction ratio in the end.

I dont know, how smart the CPU is at the index calc done, whether it will shorten k * (a + b + *c) for k=0 to zero early without bothering loading the right side. If I get it right, there is some kind of race determining k and (a + b + *c) in parallel, so I think it will at least try to load *c from mem, even if k=0 comes in early. But thats all speculative for me atm (could do some micro benchmarks to find out, but meh). Nonetheless there might be still some perf hidden here to save a few cycles, as it is the hottest of the hot paths.

jerch commented 3 years ago

Some more minor optimizations at the pixel index calc give me now these results with my default decoder:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 6.15 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 97.74 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.20 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 99.42 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 6.23 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 103.96 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 87.73 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 176.77 MB/s

Changes:

unrolled bits (saves one shift per bit)
reintroduced branching in index calc for put_single to skip sum for k=0, this is slightly faster than multiply by zero
ps.width * offset is faster in put_single than loading precalc'ed value from jump_offsets, for put it is the opposite
for --> while in put
overall gain ~5%

Code

// in put_single
    int p = ps.cursor + ps.offset;
    ps.canvas[code & 1  ? p                : 0] = color;
    ps.canvas[code & 2  ? p + ps.width     : 0] = color;
    ps.canvas[code & 4  ? p + ps.width * 2 : 0] = color;
    ps.canvas[code & 8  ? p + ps.width * 3 : 0] = color;
    ps.canvas[code & 16 ? p + ps.width * 4 : 0] = color;
    ps.canvas[code & 32 ? p + ps.width * 5 : 0] = color;

// in put
    int p = ps.cursor + ps.offset;
    int pp;
    int r;
    if (code & 1)  { pp = p + ps.jump_offsets[0]; r = n; while (r--) ps.canvas[pp++] = color; }
    if (code & 2)  { pp = p + ps.jump_offsets[1]; r = n; while (r--) ps.canvas[pp++] = color; }
    if (code & 4)  { pp = p + ps.jump_offsets[2]; r = n; while (r--) ps.canvas[pp++] = color; }
    if (code & 8)  { pp = p + ps.jump_offsets[3]; r = n; while (r--) ps.canvas[pp++] = color; }
    if (code & 16) { pp = p + ps.jump_offsets[4]; r = n; while (r--) ps.canvas[pp++] = color; }
    if (code & 32) { pp = p + ps.jump_offsets[5]; r = n; while (r--) ps.canvas[pp++] = color; }

The put code looks like there is some gain possible with pointer arithemtics, but thats actually not true for wasm. Imho a native build would benefit from doing it this way (have not checked it):

    int *p = ps.canvas + ps.cursor + ps.offset;
    int *pp;
    int r;
    if (code & 1)  { pp = p + ps.jump_offsets[0]; r = n; while (r--) *pp++ = color; }
    ...

jerch commented 3 years ago

@ismail-yilmaz

Guess I am done with optimization for now. With the latest changes I get 37-39 MB/s in xterm.js synchronous (no webworker involved). The decoder dropped to ~40% of total runtime in the browser, thus I am already in a region where only significant improvements will show further impact. And since the final solution will be web worker driven, it might get over 40 MB/s, depending on the impact of the mainthread to worker copy overhead.

I am pretty happy with these results, the wasm decoder is now 2.5 - 3 times faster than the original JS implementation.

ismail-yilmaz commented 3 years ago

@jerch

Sorry I couldn't reply earlier - very busy week. A bunch to digest here. I'll read them through tonight.

ismail-yilmaz commented 3 years ago

First try: (single paint)

        *(buffer + ((c & 1)  ? coords[0] + cursor.x : 0)) = ink;
        *(buffer + ((c & 2)  ? coords[1] + cursor.x : 0)) = ink;
        *(buffer + ((c & 4)  ? coords[2] + cursor.x : 0)) = ink;
        *(buffer + ((c & 8)  ? coords[3] + cursor.x : 0)) = ink;
        *(buffer + ((c & 16) ? coords[4] + cursor.x : 0)) = ink;
        *(buffer + ((c & 32) ? coords[5] + cursor.x : 0)) = ink;

result:

SixelRenderer(data.sixel):          2.40s (49.00 MB/s) // drops down from 56 Mib/s  (on sixel-bench) :/

Webassembly seems to defy some, if not all, of my expectations.

Edit: code fixed.

ismail-yilmaz commented 3 years ago

By the way, precalculating the Y offsets works fine here. It allows a little but real gain. So I am going to borrow it from you, and use it. (with ack, ofc.)

jerch commented 3 years ago

Yeah wasm produces some surprising results, esp. for index vs. pointer arithmetics. Normally I would expect no differences anymore for those for a native build, while 20ys ago the pointer stuff would almost always win. In wasm the index addressing is almost always faster. This might be the compiler not yet optimizing for the fastest version under wasm yet, or wasm itself with its stack limitations (cannot pull values from registers, has instead to push/pop arguments before using any instruction).