More sixel renderer optimizations.

ismail-yilmaz / upp-components

A collection of packages for U++ framework.

Other

35 stars 2 forks source link

More sixel renderer optimizations. #13

Open ismail-yilmaz opened 3 years ago

ismail-yilmaz commented 3 years ago

Hello @jerch,

I don't want to pollute your repos or other active discussion threads with my findings until I can come up with some practical optimization suggestions for your renderer. So I'll put this here.

I've been doing some heavy profiling on my SixelRenderer since last week. Here are some preliminary results:

I identified both some bottlenecks in my code and the performance sensitive critical parts of sixel rendering in general.
I achieved some nice progress by some simple tweaks since last week:
- I have managed to sequeze out a steady 42 MiB/s (2.90 secs) (parsing + rendering) on your sixel-bench animation, which means ~12 MiB/s gain. for me
- There is a small performance penalty associated with displaying the animation due to the gui event handling (latency) of our GUI toolkit, But my SixelViewer app (a small testing and benchmarking app) can display the full sixel-bench video on average 38.00 Mib (3.10 secs.). As you can guess, this does not translate 1:1 to final terminal performance, because other variables also effect the vte rendering performance. Still it can achieve slightly higher performance than MLTerm now.

sixel-renderer-optimized

The above improvements are all due to some really simple tricks -one of which I borrowed from you- and the examination of the produced assembly code. I am confident that we can raise the throughput bar even higher, because there is still room for optimization.

In the following days this week I'd like to share my findings (code changes/alterations) with you, which you might find interesting and, hopefully, useful for your c/wasm decoder.

jerch commented 3 years ago

Btw I think the heavy state handling / chewing on bytes penalty (> 50% since last changes) could be somewhat lifted with explicit SIMD usage. Have not investigated into that direction and would not expect too much of a benefit. It really depends on how good the data can be batched. At least for sequential sixel bytes in a line and color register definitions (param parsing) I see a chance here (beside that is sixel very highly fragmented, thus simd with too broad register loads might even degrade performance). Might be worth a shot for you. While I have currently simd in wasm enabled, I have to diable it again due to missing support in safari and older nodejs versions (works only in nodejs 16+).

ismail-yilmaz commented 3 years ago

Btw I think the heavy state handling / chewing on bytes penalty (> 50% since last changes) could be somewhat lifted with explicit SIMD usage. Have not investigated into that direction and would not expect too much of a benefit. It really depends on how good the data can be batched. At least for sequential sixel bytes in a line and color register definitions (param parsing) I see a chance here (beside that is sixel very highly fragmented, thus simd with too broad register loads might even degrade performance). Might be worth a shot for you. While I have currently simd in wasm enabled, I have to diable it again due to missing support in safari and older nodejs versions (works only in nodejs 16+).

Yeah that can help. In fact I have tried SIMD for at least compressed paint. We have a low level function called memcpy_t which uses very fast SIMD-based 8/16/32/64/128 mem-copy functions depending on the count of elements to be copied. Whole code was as follows:

for(int n = 0; n < 6; ++n) {
     if(c & (1 << n)) {
       RGBA *p = buffer + coords[n] + cursor.x;
        memcpy_t(p, &ink, repeat); 
     }
}

And it did give me some meaningful performance (~9 gain on average across the tests that use heavy compression). But for some reason, it leaves some artifacts on the images (the usual suspect is mem alignment/padding, which I did not bother checking at the time, as it a was test out of curiosity).. It needs further investigation. so I deferrred it as a TODO for later. Currently I am reconsidering a complete refactor of the lexical component of my sequence parser, which turned out to be a major culprit for the performance degradation (as I mentioned earlier).

jerch commented 3 years ago

Yeah thats another ground for simd, certainly generated graphics/plots could benefit from that, since they usually have a high amount of compressed sixels with high repeats.

Do you mean with ~9 gain like 9 times faster? And yes, the artefacts are prolly from mis-aligments. You can fix that with some clever switch jumping (similar to duff's device). :smile_cat:

ismail-yilmaz commented 3 years ago

Do you mean with ~9 gain like 9 times faster?

Lol. I just wish! I forgot to add % sign. :D

ismail-yilmaz commented 3 years ago

Benchmarks with and without SIMD-based memset32 in compressed sixel paint (no artefacts):

20 rounds, clean, with SIMD
SixelRenderer_Clean(sem_clean.six):                 1.50ms (60.40 MB/s)
SixelRenderer_Clean(boticelli_clean.six):           1.60ms (32.56 MB/s)
SixelRenderer_Clean(screen_clean.six):              2.30ms (22.42 MB/s)
SixelRenderer_Clean(test2_clean.sixel):             3.80ms (83.91 MB/s)
SixelRenderer_Clean(sampsa1_clean.sixel):           21.00ms (92.19 MB/s)
SixelRenderer_Clean(test1_clean.sixel):             10.00ms (57.41 MB/s)
SixelRenderer_Clean(rose16_clean.six):              0.71ms (41.17 MB/s)
SixelRenderer_Clean(fullhd_12bit_noise_clean.six):  100.00ms (150.40 MB/s)
SixelRenderer_Clean(oriole_clean.six):              0.99ms (11.00 MB/s)
SixelRenderer_Clean(cat2_clean.six):                0.68ms (7.63 MB/s)
SixelRenderer_Clean(zx81_clean.six):                0.44ms (1.16 MB/s)
SixelRenderer_Clean(leonardo_clean.six):            0.99ms (14.59 MB/s)
SixelRenderer_Clean(michael_clean.six):             0.92ms (21.00 MB/s)
SixelRenderer_Clean(trontank_clean.six):            0.98ms (16.75 MB/s)
SixelRenderer_Clean(sampsa_reencoded_clean.six):    11.00ms (57.34 MB/s)
SixelRenderer_Clean(time_clean.six):                3.30ms (47.17 MB/s)
SixelRenderer_Clean(gnuplot_clean.six):             0.95ms (10.85 MB/s)
SixelRenderer_Clean(biplane_clean.six):             1.30ms (27.99 MB/s)
SixelRenderer_Clean(demo05_clean.six):              0.87ms (39.00 MB/s)
SixelRenderer_Clean(testhlong_clean.six):           0.43ms (0.08 MB/s)
SixelRenderer_Clean(chess_clean.six):               1.20ms (18.85 MB/s)
SixelRenderer_Clean(oriole2_clean.six):             1.60ms (20.08 MB/s)
SixelRenderer_Clean(space_clean.six):               0.84ms (9.89 MB/s)

SixelRenderer(data.sixel):                          2.10s (58.00 MB/s)

20 rounds, clean, without SIMD
SixelRenderer_Clean(sem_clean.six):                 1.70ms (53.21 MB/s)
SixelRenderer_Clean(boticelli_clean.six):           1.60ms (32.14 MB/s)
SixelRenderer_Clean(screen_clean.six):              2.60ms (19.57 MB/s)
SixelRenderer_Clean(test2_clean.sixel):             4.00ms (78.57 MB/s)
SixelRenderer_Clean(sampsa1_clean.sixel):           24.00ms (82.70 MB/s)
SixelRenderer_Clean(test1_clean.sixel):             11.00ms (55.11 MB/s)
SixelRenderer_Clean(rose16_clean.six):              0.77ms (37.90 MB/s)
SixelRenderer_Clean(fullhd_12bit_noise_clean.six):  110.00ms (137.40 MB/s)
SixelRenderer_Clean(oriole_clean.six):              1.30ms (8.38 MB/s)
SixelRenderer_Clean(cat2_clean.six):                0.71ms (7.32 MB/s)
SixelRenderer_Clean(zx81_clean.six):                0.45ms (1.15 MB/s)
SixelRenderer_Clean(leonardo_clean.six):            1.30ms (11.30 MB/s)
SixelRenderer_Clean(michael_clean.six):             1.10ms (17.41 MB/s)
SixelRenderer_Clean(trontank_clean.six):            1.00ms (15.64 MB/s)
SixelRenderer_Clean(sampsa_reencoded_clean.six):    11.00ms (56.55 MB/s)
SixelRenderer_Clean(time_clean.six):                3.40ms (45.69 MB/s)
SixelRenderer_Clean(gnuplot_clean.six):             0.96ms (10.76 MB/s)
SixelRenderer_Clean(biplane_clean.six):             1.30ms (26.52 MB/s)
SixelRenderer_Clean(demo05_clean.six):              0.91ms (37.27 MB/s)
SixelRenderer_Clean(testhlong_clean.six):           0.43ms (0.08 MB/s)
SixelRenderer_Clean(chess_clean.six):               1.30ms (16.94 MB/s)
SixelRenderer_Clean(oriole2_clean.six):             1.90ms (16.87 MB/s)
SixelRenderer_Clean(space_clean.six):               0.87ms (9.47 MB/s)

SixelRenderer(data.sixel):                          2.10s (57.00 MB/s)

jerch commented 3 years ago

I would keep that, it is not overwhelming but a nice boost for contiguous color handling.

But I think that can be driven a step further - by slicing the data at CR ($) and register select (#nnn) basically only compression directives and sixel bytes are left. This can be used to approach the pixel writing always in contiguous order by loading the sixels into simd, and do the bit shift there, schematically:

on register select: create colorMask as [color, color, color, color] (4x32)
repeat for local sixels:
- load 4 sixels into 4x32 simd
- repeat 6 times:
- create 4x32 bitMask from from lsb components --> [0xFFFFFF, 0, ...]
- write: *p = *p | (colorMask & bitMask); p += width/4;
- rshift sixel vector components

Well something like that. It would reduce the instructions per pixel by quite some amount (well the needed back loading from *p is nasty and might actually perform worse, idk). Also note that the compression could be ballooned here, as it is just a special of the general handling (still might be faster on a separate code path, as the masking per bit could be done as prepare step in colorMask directly).

Edit: Some runtime estimates of that - I'd expect that to be ~2 times faster, with a big range of uncertainty introduced by alignment needs (you somehow have to get prolly misalgined sixels in there fast) and the fact, that this will do more dummy writes, where the linear handling might have skipped it. With __m256 you could even do 8 pixels at once, not sure if you want to test "bigger is better" route, wasm-simd only supports 128 as far as I am aware.

jerch commented 3 years ago

@ismail-yilmaz Played around with the simd idea above and got something working. It is still has bugs in color transfers, but the performance is already quite good for a first hacky impl:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 5.83 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 103.60 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.51 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 91.83 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 5.71 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 115.01 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 92.40 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 167.82 MB/s

Here the code for the non repeated paints:

void simd_paint() {
  if (!reg_agg) return;

  __m128i colorMask = _mm_set_epi32(ps.color, ps.color, ps.color, ps.color);
  __m128i ones = _mm_set_epi32(1, 1, 1, 1);
  __m128i sixels = _mm_load_si128((__m128i *) reg);

  int p = l_cur + ps.offset;

  for (int i = 0; i < 6; ++i) {
    __m128i singles = _mm_and_si128(sixels, ones);
    __m128i bitmask = _mm_cmpeq_epi32(ones, singles);
    __m128i updated = _mm_and_si128(colorMask, bitmask);

    __m128i old = _mm_load_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]]);
    __m128i final = _mm_or_si128(old, updated);

    _mm_store_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]], final);

    sixels = _mm_srai_epi32(sixels, 1);
  }

  _mm_store_si128((__m128i *) reg, _mm_setzero_si128());
  reg_agg = 0;
}

reg is an int[4] array, where sixels get loaded to from the main chunk loop as in cursor % 4. Whenever the write is below or above that 4 pixel area, simd_paint gets called, also for color changes and CR/LF. The single byte loading into reg and the pixel range handling is still buggy, and now the biggest runtime penalty (while the paint calls runtime dropped by almost 50%).

ismail-yilmaz commented 3 years ago

@ismail-yilmaz Played around with the simd idea above and got something working. It is still has bugs in color transfers, but the performance is already quite good for a first hacky impl:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 5.83 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 103.60 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 3.51 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 91.83 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 5.71 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 115.01 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 92.40 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 167.82 MB/s

Here the code for the non repeated paints:

void simd_paint() {
  if (!reg_agg) return;

  __m128i colorMask = _mm_set_epi32(ps.color, ps.color, ps.color, ps.color);
  __m128i ones = _mm_set_epi32(1, 1, 1, 1);
  __m128i sixels = _mm_load_si128((__m128i *) reg);

  int p = l_cur + ps.offset;

  for (int i = 0; i < 6; ++i) {
    __m128i singles = _mm_and_si128(sixels, ones);
    __m128i bitmask = _mm_cmpeq_epi32(ones, singles);
    __m128i updated = _mm_and_si128(colorMask, bitmask);

    __m128i old = _mm_load_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]]);
    __m128i final = _mm_or_si128(old, updated);

    _mm_store_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]], final);

    sixels = _mm_srai_epi32(sixels, 1);
  }

  _mm_store_si128((__m128i *) reg, _mm_setzero_si128());
  reg_agg = 0;
}

@jerch Hmm, a clever trick indeed. I get the picture now. I'll try this tomorrow. (or later today, whatever).

jerch commented 3 years ago

In that variant is still room for further optimization - eg. the loading of old from canvas can be made less often from far memory, if pulled into cache on the 4-pixel range change (when l_cur changes). However, as written above the whole paint dropped from ~40% runtime of my other decoder to ~25%, while the byte chewing on the state loop increased by almost the same amount (is now at 70%). Imho the only way to further save runtime is to do the input slicing with simd as well. But thats not that easy and since sixel data is that highly fragmented, I am not sure if there is anything to gain (batches will often trigger into slow path subchunk handling).

Edit: Btw - tried the batched sixel painted above also with an advance of 8 pixels - not with avx2 (since I cannot use it), but with 2 interim reg sixel registers calling the paint twice. Runs worse, guess it triggers to many nonsense canvas updates. With avx2 it would not be that bad (since you can do it in one go), def. worth a shot for you (but will only work with newer machines like haswell+).

ismail-yilmaz commented 3 years ago

With avx2 it would not be that bad (since you can do it in one go), def. worth a shot for you (but will only work with newer machines like haswell+).

The thing is, we officially have a nice API for up to 128 bit SIMD in U++, which simplifies things drastically (operator overloading, bit manipulation, loading storing, broadcasting, etc.). Not to mention it encapsulates ARM NEON too. This will allow me to keep the compatibility with different backends, which I've mentioned earlier. So It think it will be sufficient for me If I can get a meaningful gain with the approach you've suggested last night. The only problem I have to solve with that approach now is to efficiently read 4 sixel bytes in a row without changing the existing loop much. Besides I am almost satisfied with the results I already have (last night, another round of refactoring -wiping out my mortal laziness from the codebase- just boosted the sequence parser to 320 Mib/s on sixel-bench and that gave me a steady 62 Mib/s, which is a Mib higher than the test code in which I used static, fixed size buffer. (I wonder what'll I get with that static bufffer. Well, I'll test that too...) Seems this time I am pretty close to the limit of hardware...

jerch commented 3 years ago

Hmm thats interesting, found a big difference in native vs wasm build:

// faster in native (gcc)
__m128i sixels = _mm_load_si128((__m128i *) reg);
// faster in wasm (clang)
__m128i sixels = _mm_set_epi32(reg[3], reg[2], reg[1], reg[0]);

Seems wasm does not like pointers at all? Sadly idk how to peek into the final asm created by the wasm runtime...

ismail-yilmaz commented 3 years ago

Hmm thats interesting, found a big difference in native vs wasm build:
// faster in native
__m128i sixels = _mm_load_si128((__m128i *) reg);
// faster in wasm
__m128i sixels = _mm_set_epi32(reg[3], reg[2], reg[1], reg[0]);
Seems wasm does not like pointers at all? Sadly idk how to peek into the final asm created by the wasm runtime...

The official page writes:

_mm_load_si128	🟡 wasm_v128_load. VM must guess type. Unaligned load on x86 CPUs.

🟡 there is some information missing (e.g. type or alignment information) for a Wasm VM to be guaranteed to be able to reconstruct the intended x86 SSE opcode. This might cause a penalty depending on the target CPU hardware family, especially on older CPU generations. `

It kinda reads "Good luck!" to me. :)

jerch commented 3 years ago

Oh thx for finding this one, yeah the unaligned load is alot slower (had to get rid of it with __attribute__((aligned(16))) all over the place, which boosted native, but not wasm). No I know why...

jerch commented 3 years ago

@ismail-yilmaz Making baby steps towards SIMD parsing. Got the easy part working - slicing data at single state changing bytes (CR, LF, color/attribs/compression introducer). This high level lexing runs at 2GB/s native, and 700 MB/s in wasm (heavy usage of _mm_load_si128, no clue how to avoid that). It is ~8 times faster in native than the byte-by-byte loop, and still ~4 times faster under wasm.

The real challenge now is not to waste too much cycles on actual parsing of the subchunk data. :smile_cat: Basically I see there two options:

go with minimal copying - will run alot into unaligned SIMD instructions
move SIMD digestable bytes into aligned memory before processing them

Normally I am a big fan of less data moving, but the penalty of unaligned SIMD access seems rather high, so an additional copy step might help, if many SIMD instructions will follow. Also not sure if my SIMD-fu is good enough to get number/color parsing done with it, might just drop to the old code here (which is no biggy, as those are low on the runtime meter). Still the sixel data bytes will need special care, and better get moved over to SIMD paint in aligned batches.

Regarding parsing numbers, this might help: http://0x80.pl/articles/simd-parsing-int-sequences.html

ismail-yilmaz commented 3 years ago

This high level lexing runs at 2GB/s native, and 700 MB/s in wasm (heavy usage of _mm_load_si128, no clue how to avoid that). It is ~8 times faster in native than the byte-by-byte loop, and still ~4 times faster under wasm.

Those numbers are impressive and sounds very promising!

Still the sixel data bytes will need special care, and better get moved over to SIMD paint in aligned batches.

Yeah this is what I have in my mind too. For the next round, my strategy will be to focus on batching single sixels with color information (map them), without modifying the original loop much, then flush them at once using SIMD where count % 4 == 0 . I'd like to see the difference for better or worse.

jerch commented 3 years ago

Did some thinking/modelling of parsing with SIMD. My idea is currently the following:

input          ['#n!nnsss#nnss#nns#nss...']
load as epi8   ['#n!nnsss#nnss#nn']
slice singles  color:['n'] compression:['nnsss'] color:['nnss'] color:['nn']

These are "typed" fragments containing:

color /color definition (+ sixel bytes)
compression (+ sixel bytes)
pure sixel bytes (as continuation of previous vector)

The first fragment might be continuing from the previous vector, thus needs some sort of state carry. In the same sense the last fragment might not be finished yet. Not sure yet how efficiently deal with that.

All other fragments are "final", thus have proper outer state boundaries and can be processed atomic. The main problem here I currently have is to efficiently identify, whether a fragment has follow-up sixel bytes or not. In SIMD that could be done as follows:

__m128i data = _mm_loadu_si128(fragment);
__m128i lesser_63 = _mm_cmplt_epi8(data, 63);
__m128i lesser_127 = _mm_cmplt_epi8(data, 127);
__m128i sixels_marked = _mm_andnot_si128(lesser_63, lesser_127);

While this works, it creates a rather big penalty, and is not yet helpful in the later sixel processing.

And there is another problem with that fixed fragment idea - the biggest atomic directive the sixel format can have, is a non-disturbed color definition like:

#nnnn;2;xxx;xxx;xxx

This can only be processed in 128 SIMD in one go, if we limit the register numbering to one digit, bummer. Means at least for this one it is not possible without looping again. Furthermore sixel is a very relaxed format, thus any data can be disturbed by NOOP chars. Means we either need a pre-cleanup step, or have to do some sanitizing on the fly. Both is rather expensive. Also not sure yet how to deal with that.

jerch commented 3 years ago

Modified my top-level parser abit, which makes it slower (at 605 MB/s native), but now also precalculates the sixel byte offsets:

void decode(int length) {
  __m128i LF = _mm_set1_epi8(45);
  __m128i CR = _mm_set1_epi8(36);
  __m128i COLOR = _mm_set1_epi8(35);
  __m128i COMP = _mm_set1_epi8(33);

  int l = length / 16 * 16;
  int rem = length - l;
  for (int i = 0; i < l; i += 16) {
    __m128i part = _mm_load_si128((__m128i *) &ps.chunk[i]); // this is actually very slow in wasm!
    int start = i;
    int end = i;

    // test for singles: CR, LF, COMPRESSION and COLOR introducer
    __m128i testLF = _mm_cmpeq_epi8(part, LF);
    __m128i testCR = _mm_cmpeq_epi8(part, CR);
    __m128i testCRLF = _mm_or_si128(testLF, testCR);

    __m128i testCOLOR = _mm_cmpeq_epi8(part, COLOR);
    __m128i testCOMP = _mm_cmpeq_epi8(part, COMP);
    __m128i testCOLORCOMP = _mm_or_si128(testCOLOR, testCOMP);

    __m128i testSINGLE = _mm_or_si128(testCRLF, testCOLORCOMP);
    int hasSINGLE = _mm_movemask_epi8(testSINGLE);

    // identify sixels
    __m128i lesser_63 = _mm_cmplt_epi8(part, _mm_set1_epi8(63));
    __m128i lesser_127 = _mm_cmplt_epi8(part, _mm_set1_epi8(127));
    __m128i sixels_marked = _mm_andnot_si128(lesser_63, lesser_127);
    int sixelPos = _mm_movemask_epi8(sixels_marked);

    while (hasSINGLE) {
      // get LSB: __builtin_ctz (better with _tzcnt_u32, but not supported in wasm)
      int adv = __builtin_ctz(hasSINGLE);
      end = i + adv;
      if (end - start) parse_fragment(start, end, i + (sixelPos ? __builtin_ctz(sixelPos) : 16));
      handle_single(i + adv, ps.chunk[i + adv]);
      start = end + 1;
      hasSINGLE &= ~(1 << adv);
      sixelPos &= ~((1 << adv) - 1);
    }
    end = i + 16;
    if (end - start) parse_fragment(start, end, (sixelPos ? __builtin_ctz(sixelPos) : 16) + i);
  }

  // TODO: rem handling...
}

handle_single does the top level state switching for LF, CR, compression and color. This is pretty expensive due to compression handled here as well (throughput dropped from 2 GB/s to 800 MB/s by that), might need to handle that separately, not sure yet. (There is one thing that would support this idea - simd_paint does not need to flush output at compression borders, if the 4-pixel area is not yet full. Currently I cannot spot that from the fragment borders alone leading to superfluous paint calls.)

Now parse_fragment gets its own start and end offsets, and the start index of the next sixel data (which might be in fragment range or above). With this precalc'ed sixel byte offsets I hope I can do faster aligned sixel-pixel writes. No clue yet if the first part of a fragment (color/compression numbers) should be done in SIMD as well, prolly yes.

Edit: Fixed lousy do-while loop in code.

jerch commented 3 years ago

@ismail-yilmaz The simd_paint code above is incomplete, it needs an additional ~bitmask operation on old before mixing, otherwise the OR'ing will blend with a previously set color (eg. from fillColor).

jerch commented 3 years ago

More findings about sixel parsing - my basic mixed sixel painter looks like this now:

inline void simd_paint(__m128i sixels, int cur) {
  __m128i colors = _mm_set1_epi32(ps.color);
  __m128i ones = _mm_set1_epi32(1);

  int p = cur * 4 + ps.offset;

  for (int i = 0; i < 6; ++i) {
    __m128i singles = _mm_and_si128(sixels, ones);
    __m128i bitmask = _mm_cmpeq_epi32(ones, singles);
    __m128i updated = _mm_and_si128(colors, bitmask);

    __m128i prev = _mm_load_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]]);
    __m128i keep = _mm_andnot_si128(bitmask, prev);

    __m128i final = _mm_or_si128(keep, updated);
    _mm_store_si128((__m128i *) &ps.canvas[p + ps.jump_offsets[i]], final);

    sixels = _mm_srai_epi32(sixels, 1);
  }
}

It is meant to be fed with 4 consecutive sixels (minus 63) and the current 4-pixel aligned cursor position (128bit progression). The color blending with previous color is fixed with a & ~bitmask.

On caller level I found 2 ways being suitable to digest the mixed sixels:

additional "register array" holding vector components This was my first naive approach. It does an additional copy step, which is subpar, but turned out to be very flexible. It can literally be hooked on top of any byte-by-byte parser impl as follows:
```
// defined somewhere (initialized on decode entry)
int reg[4] __attribute__((aligned(16)));
int cursor4;
int has_data;
...
// on sixel write
reg[cursor & 3] = sixel;
has_data |= sixel;
...
// on cursor position change
if (cursor4 != cursor / 4) {
if (has_data) {
  simd_paint(_mm_setr_epi32(reg[0], reg[1], reg[2], reg[3]), cursor4);
  // or with _mm_load_si128((__m128i *) reg) (faster on native)
  has_data = 0;
  reg[0] = 0; reg[1] = 0; reg[2] = 0; reg[3] = 0;
}
cursor4 = cursor / 4;
}
```
This scattered sixel collecting by only writing when cursor4 changes reduces the paint calls by far. On the other hand the array resetting is rather expensive (can be optimized with int64 or int128 nulling). has_data is a needed optimization due to sixel's way to move the cursor around with empty sixels, without it nullish dummy paint calls would be triggered over and over.
faster local processing in sixel state Next I tried to speedup the sixel processing for multiple sixels. The fastest isolated version I've found, is a direct load into SIMD from aligned memory:
```
__m128i zero = _mm_set1_epi8(0);
...
__m128i chunk = _mm_load_si128((__m128i *) chunk_p);
__m128i spead1 = _mm_unpacklo_epi8(chunk, zero);
__m128i sixels = _mm_unpacklo_epi16(spead1, zero);
...
simd_paint(sixels, cursor4);
```
With shifting one can access the other bytes in chunk likewise. But this fastest solution only holds its promise, if the chunk data is properly aligned and you already have ruled/filtered out other bytes. Unless taking expensive pre-steps, thats not given for "natural" chunk data, in fact we have sixel bytes scattered all over the place behind color and compression directives at various positions. I started to investigate top level SIMD parsing with fragment slicing as described above, but thats still work in progress (currently 30% slower with painting applied). A better interim solution for the existing parsers might be, to benefit from the sixel byte state, thus we already know that the current byte describes a sixel, and the next one has a high chance to be a sixel byte as well, without taking chunk alignment into account. Here the fastest version I've found is a shift-or one:
```
// pretest: check of code fits sixel range
sixels = _mm_srli_si128(sixels, 4);
if (code) {
sixels = _mm_or_si128(sixels, _mm_set_epi32(code, 0, 0, 0));
has_data |= code;
}
```
This can be used for any sixels, but needs shift corrections at the beginning and end of the sixel bytes to move sixels in half done 4-pixel ranges into the right spot (we still have to respect the pixel aligment).

The second variant is in general ~20% faster (prolly due to omitting the extra memory roundtrip of the register array), but degrades for images with highly scattered sixel bytes. The reason is obvious - if you always do the shift variant but only one sixel byte at a time comes in, you have to do the shift correction and dummy paints over and over. Imho a combination of both will help here in local sixel byte state:

Do the first unaligned sixels with array register (automatically deals with the "one sixel a time" issue).
Once you hit the cursor mod 4 border, switch to shift variant eagerly, hoping more sixels to read.
When leaving, either do a shift correction + paint or save sixels back to the register. Also measured this one - for cursor % 4 == 1 the register variant is faster, for 2 and 3 the instant paint is faster.

About paint performance: Applying the register array variant to my fastest byte-by-byte parser gives ~20% boost (110 - 120 MB/s) and the shift variant ~30% for most images (120 - 130 MB/s). The paint runtime dropped from ~40% down to <20% total runtime, thus more than halved (did not count the instructions nor cycles for simd_paint vs. put_single, guess SIMD is >2 times lower). I did not yet try an optimized combination of both, but dont expect much higher values with this, just more sustainable numbers across different image data. There still might be perf gems hidden in the sixel to pixel paint path (esp. lowering the amount of dummy paints), currently I hope to settle around stable 130 MB/s on my machine with the byte-by-byte parser loop (all with wasm, native is ~15% faster already).

Things not yet covered:

SIMD paint for compressed sixels - not done yet (expected: 5-10% boost from your numbers)
SIMD number parsing for compression and colors - is that worth to be considered?
SIMD top level slicing - promising interim results as described above, but way to go to cover all edge cases in a speedy manner (expect this one to be >>200 MB/s once all is done)

Cheers :smile_cat:

Edit: Oh sweet, just found out that wasm directly supports _mm_insert_epi32, although it is sse4.1. This makes both variants above obsolete, we can simply carry around a __m128i variable to stack things up and paint on cursor change. Will further speedup things. - Thats actually not better, the dev guide even warns about these intrinsics showing poor performance...

ismail-yilmaz commented 3 years ago

@jerch Just to be clear: I can't comment or give feedback much these days because of my job right now.

I am reading your findings and really appreciate your hard work. Thus I think we should later gather these findings for a recommendation guide. Also maybe, just maybe, we can later use these findings to create a reference SIMD optimized encoder and decoder.

(Btw. Last night I have started to implement the parser in SIMD but now it has to wait for the next week until I can be free to finish & test it and give you some feedback. )

jerch commented 3 years ago

Ah no worries, no need to feel pushed or anything, to me this is mostly optimization for fun. If we get somewhere down to a ref decoder/encoder - awesome, but if not - I dont care much either. Furthermore SIMD introduces machine dependencies, not even sure how that wasm thingy will do on an ARM (whether they abstracted/shimmed things away).

jerch commented 3 years ago

One more idea to further lower half filled paint calls - by stacking the colors in a ringbuffer[4], we could lower paint calls to real 4-pixel progression (including CRLF jumps). Not sure, if this will show a real benefit (or even runs worse because of the ringbuffer overhead), as far as I know most encoder libs do a CR reset on a color change not mixing colors from current line cursor position onwards. Well. would need some investigation on typical sixel encoding schemes...

jerch commented 3 years ago

Some callgrind profiling data (shortened):

file: test1_clean.sixel
sixels: 380283
bytes:  619290

--------------------------------------------------------------------------------
        Ir        Dr        Dw  I1mr D1mr   D1mw ILmr  DLmr   DLmw 
--------------------------------------------------------------------------------

inline put_single:
32,406,294 3,793,752 2,740,265 1,988 29,0 151,80 1,88 7,867 68,950  PROGRAM TOTALS
28,843,839 3,216,772 2,262,066   85 4,806 78,876   85     3     16  ???:decode

inline simd_paint:
31,131,809 4,417,437 2,439,052 1,980 113,0 90,21 1,87 7,868 68,950  PROGRAM TOTALS
27,622,662 3,802,861 1,922,498   79 89,072 16,28   79     4     15  ???:decode

noinline put_single:
33,145,259 4,627,920 3,670,545 1,97 29,08 151,48 1,87 7,867 68,950  PROGRAM TOTALS
15,212,388 2,151,142 1,292,548   69 4,836 15,803   69     3     16  ???:decode.part
14,370,416 1,899,798 1,899,798    4     0 62,724    3     .      .  ???:put_single(int, int)

noinline simd_paint (with prepare_paint helper):
32,095,566 4,907,242 2,565,930 1,97 113,59 90,17 1,874 7,86 68,950  PROGRAM TOTALS
16,726,671 1,835,401 1,093,683   64  6,272 15,24   64     3     15  ???:decode
11,261,790 2,377,489   875,917    6 83,365     0    6     1      .  ???:prepare_paint(int, int)
   597,958    79,776    79,776    4      0   954    4     .      .  ???:put_single(int, int)

noinline simd_paint (optimization - avoid dummy paints):
29,182,602 4,320,719 2,380,693 1,97 102,42 92,29 1,87 7,868 68,949  PROGRAM TOTALS
15,910,154 1,856,080 1,020,775   22  5,775    36   22     3     15  ???:decode
 8,271,960 1,711,440   665,560    6 72,695     0    6     .      .  ???:prepare_paint(int, int)
   798,301    58,846    98,027   39      6 15,94   39     .      .  ???:put(int, int, int)
   597,958    79,776    79,776    4      0  2,83    4     .      .  ???:put_single(int, int)

Instructions/sixel for single paints dropped from ~38 to ~22. Now one SIMD paint calculates to ~88 instructions, which is pretty close to my rough estimate from code (84 instructions), means we have literally no dummy paints anymore. 22 sounds still pretty high per sixel byte, well I currently see no way how to make that any faster.
Instructions/byte for non-sixel bytes is really high with ~90, guess we found the slowpoke :smile_cat:

jerch commented 3 years ago

Started to wonder, how about a simple thing like the RGB color conversion, thus went on and tried SIMD versions of it:

int normalize_rgb(int r, int g, int b) {
  return 0xFF000000 | ((b * 255 + 99) / 100) << 16 | ((g * 255 + 99) / 100) << 8 | ((r * 255 + 99) / 100);
}

int normalize_rgb_simd_int(int r, int g, int b) {
  // algo: ((x * 255 + 99) * 0xA3D7 + 0x8000) >> 22
  __m128i reg = _mm_set_epi32(r, g, b, 100);
  reg = _mm_mullo_epi32(reg, _mm_set1_epi32(255));
  reg = _mm_add_epi32(reg, _mm_set1_epi32(99));
  reg = _mm_mullo_epi32(reg, _mm_set1_epi32(0xA3D7));
  reg = _mm_add_epi32(reg, _mm_set1_epi32(0x8000));
  reg = _mm_srli_epi32(reg, 22);
  __m128i result = _mm_shuffle_epi8(reg, _mm_set_epi8(
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x00, 0x04, 0x08, 0x0C
  ));
  return _mm_cvtsi128_si32(result);
}

int normalize_rgb_simd_float(int r, int g, int b) {
  __m128 reg = _mm_set_ps(r, g, b, 100);
  reg = _mm_mul_ps(reg, _mm_set1_ps(2.55f));
  reg = _mm_round_ps(reg, _MM_FROUND_TO_NEAREST_INT);
  __m128i result = _mm_cvtps_epi32(reg);
  result = _mm_shuffle_epi8(result, _mm_set_epi8(
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x80, 0x80, 0x80, 0x80,
    0x00, 0x04, 0x08, 0x0C
  ));
  return _mm_cvtsi128_si32(result);
}

and here is the asm for those (gcc 11) with cycle count in front (taken from intel optimization guide):

  normalize_rgb(int, int, int):
1         mov     r8d, edx
1         mov     edx, esi
1         sal     edx, 8
1         sub     edx, esi
1         add     edx, 99
1         movsx   rax, edx
1         sar     edx, 31
3         imul    rax, rax, 1374389535
1         sar     rax, 37
1         sub     eax, edx
1         mov     edx, edi
1         sal     edx, 8
1         sal     eax, 8
1         sub     edx, edi
1         add     edx, 99
1         movsx   rcx, edx
1         sar     edx, 31
3         imul    rcx, rcx, 1374389535
1         sar     rcx, 37
1         sub     ecx, edx
1         or      eax, ecx
1         mov     ecx, r8d
1         sal     ecx, 8
1         sub     ecx, r8d
1         add     ecx, 99
1         movsx   rdx, ecx
1         sar     ecx, 31
3         imul    rdx, rdx, 1374389535
1         sar     rdx, 37
1         sub     edx, ecx
1         sal     edx, 16
1         or      eax, edx
1         or      eax, -16777216
39        ret

 normalize_rgb_simd_int(int, int, int):
1         mov     eax, 100
1         movd    xmm0, esi
1         movd    xmm1, eax
2         pinsrd  xmm0, edi, 1
2         pinsrd  xmm1, edx, 1
1         punpcklqdq      xmm1, xmm0
1         movdqa  xmm0, xmm1
1         pslld   xmm0, 8
1         psubd   xmm0, xmm1
1         paddd   xmm0, XMMWORD PTR .LC0[rip]
10        pmulld  xmm0, XMMWORD PTR .LC1[rip]
1         paddd   xmm0, XMMWORD PTR .LC2[rip]
1         psrld   xmm0, 22
1         pshufb  xmm0, XMMWORD PTR .LC3[rip]
1         movd    eax, xmm0
26        ret

 normalize_rgb_simd_float(int, int, int):
1         pxor    xmm2, xmm2
1         pxor    xmm1, xmm1
1         pxor    xmm3, xmm3
1         movss   xmm0, DWORD PTR .LC4[rip]
5         cvtsi2ss        xmm2, edx
5         cvtsi2ss        xmm1, esi
5         cvtsi2ss        xmm3, edi
1         unpcklps        xmm0, xmm2
1         unpcklps        xmm1, xmm3
1         movlhps xmm0, xmm1
5         mulps   xmm0, XMMWORD PTR .LC5[rip]
6         roundps xmm0, xmm0, 0
3         cvtps2dq        xmm0, xmm0
1         pshufb  xmm0, XMMWORD PTR .LC3[rip]
1         movd    eax, xmm0
38        ret

Performance: the normal version wins, simd_int is 2% slower, simd_float is 5% slower. This is somewhat weird when only looking at the cycle counts, but makes sense when taking parallel µ-ops into account. This will change, if the channel values would not be provided from normal registers but are already in SIMD (the alignment and preparation steps are quite expensive currently). Comparing simd_int to simd_float the int variant for the "inner algo" is one cycle faster (15 vs, 16 cycles), even with the nasty pmulld.

Edit: the float version drops down to 20 cycles, if already fed with floats (and currently wins the race in the decoder). Still think the int variant will be the winner in the end, as the values are not meant to be floats. I really wonder, if I could get rid of that pmulld somehow...

jerch commented 3 years ago

@ismail-yilmaz I kinda give up with the high level SIMD tokenization. I tried 3 different approaches now (with the last being the fastest, but still 25% slower than my fastest byte-byte loop). It turns out that the high fragmentation of sixel data needs many state changes for one 16 byte slice (one register load), which creates lots of single byte extractions (which are cumbersome with SIMD) and alignment corrections on number and sixel bytes (the only ones that are consecutive to some degree). But again those consecutive bytes cannot be processed directly in SIMD, as they need special preparations:

sixel bytes must be padded to 32bit and aligned to the 4-pixel progression
number bytes must be padded to 16bit to use _mm_madd_epi16, number of digits have different code paths The number conversion with madd is indeed much faster for +3-digit numbers (3 times faster), but falls short for 1-digit numbers (4 times slower). The additional need for branching over the number of digits introduces another nonsense penalty, esp in wasm (that digit switch made the wasm build 30% slower). Ofc chances are high that I overlooked here something, yet sixel data often contain 1- and 2-digit numbers, thus I think there is not much to be gained here.

Nonetheless here is my last and fastest top level SIMD attempt, in case you want to play with it or have better idea, how to approach the fragment data:

typedef union Vec128 {
  __m128i  vector;
  long long i64[2];
  int i32[4];
  short i16[8];
  char byte[16];
} Vec128;

inline void handle_unclear(char code) {
  switch (code) {  // <---- thats the speed showstopper...
    case '!':
      ps.state = ST_COMPRESSION;
      break;
    case '#':
      ps.state = ST_COLOR;
      break;
    case '$':
      ps.cursor = 0;
      break;
    case '-':
      ps.y_offset += 6;
      ps.offset = ps.y_offset * ps.width + 16;
      ps.cursor = 0;
      break;
    case ';':
      break;
  }
}

static int sixels_or_numbers_counter = 0;
inline void handle_sixels_or_numbers(Vec128 reg, int start, int end, int number_bits, int sixel_bits) {
  // to not get removed by optimzation
  sixels_or_numbers_counter++;

  // follow-up code removed for less cluttering
}

// like BLSR, but not available in wasm
inline int clear_bits_below(int x) {
  return x & (x - 1);
}

inline __m128i mask_sixels(const __m128i input) {
  const __m128i tmp = _mm_add_epi8(input, _mm_set1_epi8(65));
  return _mm_cmplt_epi8(tmp, _mm_set1_epi8(-64));
}

inline __m128i mask_numbers(const __m128i input) {
  const __m128i tmp = _mm_add_epi8(input, _mm_set1_epi8(80));
  return _mm_cmplt_epi8(tmp, _mm_set1_epi8(-118));
}

void decode____(int length) {
  Vec128 reg;

  int l = length / 16 * 16;
  int rem = length - l;
  if (rem) {
    for (int k = l + rem; k < l + 16; ++k) ps.chunk[k] = 0;
    l += 16;
  }
  for (int i = 0; i < l; i += 16) {
    reg.vector = _mm_lddqu_si128((__m128i *) &ps.chunk[i]);

    // strip high bit
    reg.vector = _mm_and_si128(reg.vector, _mm_set1_epi8(0x7F));

    // identify sixel & numbers
    int sixels = _mm_movemask_epi8(mask_sixels(reg.vector));
    int numbers = _mm_movemask_epi8(mask_numbers(reg.vector));

    // identify unclear bytes
    int unclear = 0xFFFF & ~(sixels | numbers);

    int pos = 0;
    while (unclear) {
      int adv = __builtin_ctz(unclear);
      if (pos < adv) {
        handle_sixels_or_numbers(reg, pos, adv, numbers, sixels);
      }
      handle_unclear(reg.byte[adv]);
      pos = adv + 1;
      unclear = clear_bits_below(unclear);
    }
    if (pos < 16) {
      handle_sixels_or_numbers(reg, pos, 16, numbers, sixels);
    }
  }
}

This top level SIMD loop needs alot less instructions at the 16-bytes slice than my first version, and groups slices into unclear and sixels_or_numbers. The __builtin_ctz is problematic for machines without native TZCNT (< haswell I think), typically it gets triggered around 2 - 6 times for every 16-byte from my data analysis. Everything else is a piece of cake in that main loop (throughput ~3 GB/s). The real perf showstopper is the switch statement in handle_unclear (throughput drops down to 700 MB/s). At first I wondered if the byte extraction in reg.byte[adv] is the problem, but replacing the switch with a simple add on a static variable restore most of the speed. It seems that the CPU doesnt like branching here. Maybe you have a good idea how to prevent that perf decrease, I didnt so far, the code paths are way to different to think about a branch-less version.

When applying the real data parsing (color/compression/sixels) down the code paths, things get ugly due to extra work needed for padding and alignment (not shown above). So far the only ground that shows real speed improvement with SIMD is the sixel-to-pixel path.

I gonna stop those top level SIMD attempts for now until you have a fundamental better idea, how to approach the data. In the meantime I gonna try to micro optimize my byte-by-byte loop (repeated SIMD painting still missing). The last trick I have in mind is to reduce the dummy paints* to almost 0, at least across compression-sixels-compression progression (not really useful across color changes as they almost always contain cursor movements with CR). Looking at the real asm output, simd_paint is only 72 instructions, means that my calculated 88 instructions still contains 20% dummy calls. Getting rid of those should give another 3-5% speedup. Thats not much, but would give me 130-160 MB/s for most images I have for testing here, which is still good compared to 92-95 MB/s without SIMD. After that change I am out of ideas. :smile_cat:

[*] You might have wondered in my last posts about "dummy paints". I call any paint a "dummy paint", if the 4-pixel range is underfull, thus sixels are missing, which create additional "nonsense" paints with simd_paint. Example:

$!5abc!3def creates the following:

repeated paint a 5 times --> 2 4-pixel aligned paints: [a, a, a, a], [a, 0, 0, 0]
paint bc --> 1 4-pixel aligned paint: [0, b, c, 0]
repeated paint d 3 times --> 2 4-pixel aligned paints: [0, 0, 0, d], [d, d, 0, 0]
paint ef --> 1 4-pixel aligned paint: [0, 0, e, f]

In total simd_paint gets called 6 times, but only writes [a, a, a, a], [a, b, c, d], [d, d, e, f] in the end (3 calls would do), thus we have 3 dummy paints.

jerch commented 3 years ago

Did some more local changes to my byte-to-byte loop (mostly introducing inner loops with less conditions):

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 4.85 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 127.80 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 2.75 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 116.20 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 4.41 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 147.54 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 84.26 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 184.07 MB/s

This gets me to 39-40 MB/s synchronous in xterm.js. I think I gonna stop with optimizations here, as the numbers in xterm.js start to look funny (one run of sixel-bench):

(The utf8 conversion does not more than a uint8 --> uint32 padding expansion (xterm.js works with utf32 internally) and the sequence parser already has a fast inner loop with only 3 conditions to check for DCS data. I hate JS for being so slow on such simple tasks. Would not have happened with wasm.)

ismail-yilmaz commented 3 years ago

@jerch ,

This gets me to 39-40 MB/s synchronous in xterm.js.

That sounds like a very good number for a JS sixel decoder (o so I assume, because the only one I know in the wild is yours.)

I'll take your findings and try moving it forward on a test setup , starting this monday as I will have a window (we'll have a national holiday season this whole week.) to work on something fun. Will post my findings on x86_64. So. stay tuned. :)

jerch commented 3 years ago

Latest numbers:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 4.68 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 134.27 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 2.58 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 124.60 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 4.79 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 142.69 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 77.71 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 199.62 MB/s

While I missed my >130 MB/s goal with "test2_clean.sixel" (not even sure why), I got a nice boost on the fullHD noise image, which is now at 12fps. Thats still very low, on the other hand the image is really hard to chew on with its 4096 colors and the single pixel addressing (cursor kinda moving from 0 to target position for tons of 1bit-sixel). I have no test data for normal fullHD stuff, but I think it is capable to stay >30 fps for less degenerated data.

jerch commented 3 years ago

@ismail-yilmaz Not sure how good you know callgrind, I have an interesting case, which I dont quite understand, maybe you have an idea or can explain, whats going on.

Playing around with the byte-by-byte loop I found this loop schematics to be the fastest:

int decode(int length) {
  for (int i = 0; i < length; ++i) {
    int code = ps.chunk[i] & 0x7F;
    switch (ps.state) {
      case ST_DATA:
        while (62 < code && code < 127) {
          ...  // process sixel bytes
          if (++i >= length) break;  // <--------
          code = ps.chunk[i] & 0x7F;
        }
        break;
      case ST_COMPRESSION:
        while (47 < code && code < 58) {
          ...  // process number bytes
          if (++i >= length) break;  // <--------
          code = ps.chunk[i] & 0x7F;
        }
        break;
    }
  }
}

It basically sub-loops at all positions, where multiple occurrences are possible skipping the outer loop and switch statement. Ofc now I need those inner break conditions on length (marked with <-----).

Now looking into callgrind I see many instructions being burnt on those length-break checks. Thus I went further and restructured the break conditions as outer fall-through:

int decode(int length) {
  ps.chunk[length] = 0xFF;  // new fall-through break condition in data
  for (int i = 0; i < length; ++i) {
    int code = ps.chunk[i] & 0x7F;
    switch (ps.state) {
      case ST_DATA:
        while (62 < code && code < 127) {
          ...  // process sixel bytes
          code = ps.chunk[++i] & 0x7F;
        }
        break;
      case ST_COMPRESSION:
        while (47 < code && code < 58) {
          ...  // process number bytes
          code = ps.chunk[++i] & 0x7F;
        }
        break;
    }
  }  // 0x7F in code will reach here as NOOP breaking the for loop eventually on next i < length check
}

This works great and reduces counted instructions for test1_clean.sixel by almost 4%! But runtime did not change at all for native, and in wasm it is even ~2% slower. Any clue whats going on? Is the instruction counter flawed in callgrind? Or is this some sort of a pipeline effect, where those excess instructions are hidden as uop not altering whole pipeline?

ismail-yilmaz commented 3 years ago

@jerch

Or is this some sort of a pipeline effect, where those excess instructions are hidden as uop not altering whole pipeline?

I haven't checked it yet ( I will), but it is likely a pipeline effect :(on x86, at least):

Remember this? That is exactly how I process parameters (and also optimized my sequence parser) on x86_64. In almost all scenarios the while loop works faster than the for loop for such operations that it is likely to loop more than once.

E.g two style of reading parameter chunks (Size: 95.37 MB):

For loop:

int GetNum1(const String& s)
{
    int n = 0;
    for(const int& c : s) {
        if(dword(c - '0') > 9)
            break;
         n = n * 10 + (c - '0');
    }

    return n;
}

While loop

int GetNum2(const String& s)
{
    const char *p = ~s;
    int c = 0, n = 0;
    while(*p && dword((c = *p++) - '0') < 10)
        n = n * 10 + (c - '0');

    return n;
}

Timing (20 rounds, O3):

TIMING GetNum2 (while):  2.75 s  - 137.50 ms ( 2.75 s  / 20 ), min: 135.00 ms, max: 141.00 ms, nesting: 0 - 20
TIMING GetNum1 (for)    :  2.94 s  - 147.20 ms ( 2.94 s  / 20 ), min: 144.00 ms, max: 151.00 ms, nesting: 0 - 20

ismail-yilmaz commented 3 years ago

The problem with this approach, in my experience at least, is it performs worse with single sixel paints (branch misses tends to be high and winding/unwinding the stack for the loop at that "hot" point has a greater cost than its potential gains.)

jerch commented 3 years ago

The problem with this approach, in my experience at least, is it performs worse with single sixel paints (branch misses tends to be high and winding/unwinding the stack for the loop at that point has a greater cost than its gain.)

Yes thats true, but the image triggers that case only 157 times. Overall the sub loops are much faster in wasm (not so much in native). Ofc the stack unwinding back from the inner switch case might be much more expensive here (br_table moves quick some stuff on the stack in wasm). Well I dont have something like callgrind for wasm, thus can only profile things indirectly.

Nonetheless will try to reshape the outer thing into a while loop as well, it is abit annoying that I cannot use pointer stuff that easy in wasm, not clue why I get such a bad penalty for using those.

jerch commented 3 years ago

@ismail-yilmaz Note there is an issue with your while loop above - your loop condition tests for *p not being zero. Imho you cannot assume null termination here. Although NULL in terminal data is a NOOP for all states I know, it does not mark end of data by any means. And the parser from vt100.net would pass NULL along, thus it has to be handled by the subparser (sixel decoder here).

(Some background on this: NULL was used by some very early terminals as "time padding bytes" (give the device some time to keep up), thus it is theoretically possible to see cascades of NULLs in old data.)

ismail-yilmaz commented 3 years ago

@jerch ,

Ah, that's just a synthetic test loop to display the difference, the actual parser is based on Upp::Stream and uses Stream::Get() and Stream::Peek(), which return -1 on eof or stream error (meaning, 0 is considered a valid byte.

Besides, that loop can be used to parse the parameters in my use-cases, because our parser, firstcollects and then splits the parsed valid parameter chunks into a vector of Strings. Strings are null terminated. So at that point it is guaranteed to have no '\0' in the string. except for the null terminator.

jerch commented 3 years ago

Another optimization idea I had while looking at callgrind metrics - currently simd_paint puts colors for bits into final canvas target cells, which creates a rather high cache miss rate for the load/store ops (~7%, reason is the massive cell offset created by the the 6 pixel progression in y-direction and loading/blending with previous colors). This could be avoided by doing a local construction with contiguous writes:

input sixel: [a, b, c, d]
local canvas line writes: [[a1, b1, c1, d1], ..., [a6, b6, c6, d6]]

and a final copy step on LF:

memcpy: *[a1-d1] --> canvas_offset_1
...
memcpy: *[a6-d6] --> canvas_offset_6

This has several advantages in terms of cache and SIMD utilisation:

load/blend/write dance during construction has much better cache locality due to contiguous mem access
writes are guaranteed to be 4 * 32bit aligned (good for simd load/store)
final copy step is still cache-unfriendly, but can be done in steps with high cache utilization (e.g. in 16 pixel progression)
final copy step is a "write-through", no loading/blending needed anymore

Ofc this has a major drawback - the additional copy step itself, which might just be more expensive than the savings from better cache utilization achieved. Remains uncertain until actually tried.

Another similar idea would be to spread sixel bits into SIMD registers (horizontal spreading), instead of the current shift-or looping (vertical spreading):

input sixel: [a, b, c, d]
local canvas line writes: [[a1, a2, a3, a4], [a5, a6, b1, b2], ..., [d3, d4, d5, d6]]

Well I dont expect that to be any faster for 3 reasons:

bit spreading is quite cumbersome with SIMD (needs several shuffle ops)
6 bits of sixel dont align well with 4 * 32bit in registers - needs branching/looping over [a1-a4], [a5-a6, b1-b2], [b3-b6] writes with expensive construction step for the interleaved 2. case
final copy step gets really awkward and prolly really expensive (has to extract/shuffle bit positions into contiguous order)

Not sure if I will try any of these ideas, they need quite some canvas construction refactoring, before they would work at all...

jerch commented 3 years ago

Some more microoptimizations here and there:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 20 runs - average runtime: 4.16 ms
         Case "test1_clean.sixel" : 20 runs - average throughput: 149.45 MB/s
         Case "test2_clean.sixel" : 20 runs - average runtime: 2.42 ms
         Case "test2_clean.sixel" : 20 runs - average throughput: 132.83 MB/s
         Case "sampsa_reencoded_clean.six" : 20 runs - average runtime: 4.05 ms
         Case "sampsa_reencoded_clean.six" : 20 runs - average throughput: 161.45 MB/s
         Case "FullHD 12bit noise" : 20 runs - average runtime: 67.41 ms
         Case "FullHD 12bit noise" : 20 runs - average throughput: 230.24 MB/s

Also tested against 5 fullHD screenshots, they all run at 40-50 FPS, and the degraded noise example is now at 15 FPS.

Furthermore I partially tested the first idea of my last post - well, it is not worth the trouble. A cache optimized load/store in paint_simd gives only 3% boost, but gets more than eaten by the final canvas construction/copy, which suffers the same cache problem (well less often, but still taking >5%).

Will push the code once I cleaned it up abit.

jerch commented 3 years ago

Extended the benchmark script with more image relevant metrics to get some more explanation of bottlenecks:

      Context "decode - testfiles (WasmDecoder)"
         Case "test1_clean.sixel" : 50 runs - average runtime: 3.68 ms
         Case "test1_clean.sixel" : 50 runs - average throughput: 164.39 MB/s
          --> image throughput { FPS: '271.50', PPS: '250.21 M', pixelWrite: '1.00 GB/s' }
         Case "test2_clean.sixel" : 50 runs - average runtime: 2.15 ms
         Case "test2_clean.sixel" : 50 runs - average throughput: 148.45 MB/s
          --> image throughput { FPS: '465.29', PPS: '104.46 M', pixelWrite: '417.86 MB/s' }
         Case "sampsa_reencoded_clean.six" : 50 runs - average runtime: 3.60 ms
         Case "sampsa_reencoded_clean.six" : 50 runs - average throughput: 181.07 MB/s
          --> image throughput { FPS: '277.93', PPS: '305.00 M', pixelWrite: '1.22 GB/s' }
         Case "FullHD 12bit noise" : 50 runs - average runtime: 64.36 ms
         Case "FullHD 12bit noise" : 50 runs - average throughput: 240.98 MB/s
          --> image throughput { FPS: '15.54', PPS: '32.22 M', pixelWrite: '128.87 MB/s' }
         Case "640x480 9bit tiles" : 50 runs - average runtime: 0.67 ms
         Case "640x480 9bit tiles" : 50 runs - average throughput: 151.24 MB/s
          --> image throughput { FPS: '1494.92', PPS: '459.24 M', pixelWrite: '1.84 GB/s' }
         Case "FullHD 1" : 50 runs - average runtime: 20.36 ms
         Case "FullHD 1" : 50 runs - average throughput: 166.72 MB/s
          --> image throughput { FPS: '49.11', PPS: '101.84 M', pixelWrite: '407.36 MB/s' }
         Case "FullHD 2" : 50 runs - average runtime: 22.69 ms
         Case "FullHD 2" : 50 runs - average throughput: 167.93 MB/s
          --> image throughput { FPS: '44.07', PPS: '91.38 M', pixelWrite: '365.50 MB/s' }

FPS is frame per second (here full images decoded per second), PPS is pixels per second and pixelWrite the effective pixel write memory throughput.

To make sense of these numbers, we need to know some facts about the images itself:

real world images:
- test1: palette 256, RGB encoded, no transparency, dimensions in 4^n, encoder node-sixel
- test2: palette 256, RGB encoded, no transparency, dimensions not in 4^n, encoder node-sixel
- sampsa: palette 256, RGB encoded (monochrome), no transparency, dimensions in 4^n, encoder node-sixel
- FullHD 1 & 2: palette 256, RGB encoded, no transparency, dimensions in 4^n, encoder img2sixel
constructed images:
- worst case - FullHD 12bit noise: palette 4096, RGB encoded, no transparency, dimensions in 4^n, degraded with tons of CR repositioning and single sixel bits
- best case - 640x480 9bit tiles: palette 512, RGB encoded, no transparency, dimensions in 4^n, compressed sixels with high bit density (10x10 color tiles)

Interpretation:

Real world images, that did not see special care, seem to settle around ~100M PPS or 400 MB/s effective pixel write throughput (that is "test2", "FullHD 1", "FullHD 2"). Still sounds rather lowish to me.
"test1" runs much better than the other real images, maybe some sort of over-optimization (it is my standard profiling image).
No clue yet, why "sampsa" runs that much better (only difference is it being monochrome, but the RGB converter does not special case that).
"test2" dimensions are not 4^n aligned, which leads to unaligned SIMD access. While I suspected this to be a major bottleneck, it only accounts for -5% speed on "test2" (tested with an aligned version of that image).
The images show all the same cache behavior (4-5% missrate), the results are not affected by different cache behavior for different image sizes (tested up to 1920x1080). This was further confirmed by tests with pixel writes against a fixed zero offset - the throughput only changes in +- 3%. I expect a big perf dent for very "wide" images, where 6 * (RGBA8888) image_width will overflow L3 cache size / invalidate often (not tested). But since most CPUs have at least 2MB cache these days, this will not happen for normally shaped images. (Ofc there is another cache drain step at L1/L2 border, that will be triggered much earlier somewhere around 600 - 1300px width, did not yet test for that). To make it short - the cache unfriendly sixel format is not a problem on L2/L3 vs. DRAM level.
pixelWrite highly varies across the images, while the input throughputs are pretty close. This indicates, that the input loop and the preparation is the main limiting factor, not the pixel writes (we already knew that). It also reveals, that, while "worst case" has the highest absolute input throughput, it really sucks in pixel writes. Not sure yet, if the degraded CR + empty sixel jumps can be further optimized. This image also has the worst sixel bit density with mainly 1 bit per sixel byte. Also the 4-pixel simd write cannot help here much, as a color change is very likely for every sixel, which leads to more "dummy paints" with excess 4-pixel load/store cycles - theoretically up to 4 times more often for single color single pixel addressing (have not analysed the data, at least the PPS numbers suggest degradation around 3 times).

Summary: These new numbers do not help as much as I had hoped at the decoder alone, they mostly confirm what we already knew - the input consumption being slow, especially for low sixel bit density. But they might come handy for an encoder to try different strategies and maybe speed up the decoder indirectly by outputting more decoder-friendly data.

jerch commented 3 years ago

@ismail-yilmaz Some updates from my side. Have not yet pushed the code as the changes and integration still takes time, because of the level 1 vs. level 2 handling and the question, whether to truncate at raster dimensions.

While trying to find a working API to cover all these details I made and interesting observation, that might also help you to get better cache utilization:

My wasm implementation started as level2 truncating only, as it promised way greater optimizations. In the end we are only interested in the final pixel array, thus it seemed obvious to do that as one big blob of memory to avoid copy costs (even static to save pointer indirections). Now while trying to solve the level1 / non-truncating issue I started playing around with a sixel-line decoder (6-pixels in height), where after finishing the line the pixels need to be copied over before the next line can be processed. This involves one additional copy action for the whole pixel data in the end. To my surprise this is in wasm slightly faster than the big blob approach (~ 1-3%, I think in native as well, but did only a few checks).

Imho this is a pretty big cache side effect overcompensating an additional copy step (prolly due to the big blob going into far memory regions for big images with bad cache locality again). Still have to understand the details, also I am not settled yet, whether to keep the level 1/2 distinction, or go with just one general purpose decoder in the end (that would be way easier to maintain).

If you want to try something similar, I did the following:

set line length to 32768 pixels at max (should be some time before we get to 64k)
spread the 6 pixel lines of a sixel at those offsets (needs 12 * 65536 memory pages in wasm, ~ 0.75 MB)
whenever a sixel line is finished (or at end of image), copy the written pixels elsewhere
flush line memory with fill_color
repeat with next sixel line

On the first sight it is not really obvious, why this might lead to better cache utilization, as the memory usage is quite spread across those 0.75 MB. Imho the following happens: really every memory interaction (sixel-to-pixel write, off-copy, flush) touches these 6 memory areas prolly keeping it hot in the cache. Only for very wide images thrashing will occur, when the cursor gets too far to the right. Just a theory atm, if it is true there should be a significant throughput drop at some line width depending on the cache size.

ismail-yilmaz commented 3 years ago

Hello @jerch ,

Thanks for the update! I am reading your posts, passively. As you prolly noted, I've taken a short break, after a long and very exhausting season. I'll be back, and try to implement + continue testing + reporting back as usual starting from Aug 29 and on.

jerch commented 3 years ago

Trying to get a hold of the cache effects - with a very simple autogenerated image across various widths I see the following behavior in my benchmark:

      Context "Cache"
         Case "16" : 100 runs - average throughput: 49.26 MB/s
         Case "64" : 100 runs - average throughput: 95.27 MB/s
         Case "128" : 100 runs - average throughput: 116.08 MB/s
         Case "256" : 100 runs - average throughput: 111.16 MB/s
         Case "512" : 100 runs - average throughput: 106.59 MB/s
         Case "1024" : 100 runs - average throughput: 106.55 MB/s
         Case "4096" : 100 runs - average throughput: 93.51 MB/s
         Case "16384" : 100 runs - average throughput: 90.70 MB/s

It seems the throughput peaks somewhere around 128px width and has a bigger drop between 1024 and 4096. Doing these areas more in detail reveals, that the throughput peaks between 120-180px with 125 MB/s max, and around 1200-1400px the throughput drops from ~107 to ~95 MB/s out of a sudden.

The drop around 1300px is pretty close to my L1 cache size of 32kB, which strongly suggests that cache behavior changed here. I have no good explanation for the lower peak yet, it simply might be a summation effect of overall buffer/cache utilization. Furthermore my tests are partially screwed by GC actions, which might show up non-deterministically (might be the reason for the rather big ranges across the runs). Trying to test cache behavior with JS is def. not a good idea :sweat_smile:.

(Btw very small images <64px width are much slower in throughput, I guess that the function call overhead for off-copying pixels gets significant here.)

jerch commented 3 years ago

The new line-based decoder opened the field for further optimizations, which kinda makes the SIMD attempts obsolete under wasm.

line-based w'o SIMD:

  Context "decode - testfiles (WasmDecoder)"
     Case "test1_clean.sixel" : 50 runs - average throughput: 158.07 MB/s
     Case "test2_clean.sixel" : 50 runs - average throughput: 157.47 MB/s
     Case "sampsa_reencoded_clean.six" : 50 runs - average throughput: 156.20 MB/s
     Case "FullHD 12bit noise" : 50 runs - average throughput: 325.19 MB/s
     Case "640x480 9bit tiles" : 50 runs - average throughput: 146.03 MB/s
     Case "FullHD 1" : 50 runs - average throughput: 172.81 MB/s
     Case "FullHD 2" : 50 runs - average throughput: 180.82 MB/s

big blob with SIMD:

  Context "decode - testfiles (WasmDecoder-simd)"
     Case "test1_clean.sixel" : 50 runs - average throughput: 164.39 MB/s
     Case "test2_clean.sixel" : 50 runs - average throughput: 148.45 MB/s
     Case "sampsa_reencoded_clean.six" : 50 runs - average throughput: 181.07 MB/s
     Case "FullHD 12bit noise" : 50 runs - average throughput: 240.98 MB/s
     Case "640x480 9bit tiles" : 50 runs - average throughput: 151.24 MB/s
     Case "FullHD 1" : 50 runs - average throughput: 166.72 MB/s
     Case "FullHD 2" : 50 runs - average throughput: 167.93 MB/s

It seems SIMD is not yet that usable and well optimized for wasm, thus I will stick with the generic version for now, which also covers more browsers. Note that for x86-native the sixel SIMD decoder runs "test1_clean.sixel" with 220 MB/s (~40% faster).

offtopic: Kinda made an even worse observation with my base64 lib, where under wasm SIMD actually runs at half the speed of the generic version, while under x86-native it is 5x faster. Well, the base64 decoder in SIMD is very stack-var demanding with cumbersome access needs, which hurts alot more on a stack machine like wasm, while a register machine can carry along those values in registers.

jerch commented 3 years ago

@ismail-yilmaz FYI - pushed the my latest wasm code to the PR: https://github.com/jerch/node-sixel/pull/20.