Display optimizations (between 2x00 and 8x00 times faster) (ignore, superseded by #160)

Context

While the runtime of a general application using pyte is dominated by stream.feed for the standard geometry (24x80), the runtime of screen.display gets dominant for larger geometries (240x800, 2400x80, 24x8000).

This is because screen.display does not use the fact that screen.buffer is sparse and iterates over the whole range of possible coordinates (x,y) in the screen, wasting time accessing non-existing entries in screen.buffer.

Proposal

This PR does a series of changes to the screen.display method to make it faster with 4 changes:

make screen.display aware that screen.buffer is sparse and iterate over the real existing chars and not over the range of coordinates (bfeab39c2)
inline the generator into a for-loop: generators coded in Python (not in C) have a lower performance than traditional for-loop so a change is an easy win ( 5b32e257b)
remove an assert that was called for every single char: the corresponding check was moved to the tests so we don't loose coverage (13ee784ac)
cache wcwidth on each char: while wcwidth is already a function with a cache (thanks to functools), calling wcwidth still requires to do a call. We can avoid that storing the results of wcwidth on the char during the screen.draw and reuse it later in screen.display (c298bd358)

Results

For the standard geometry of 24x80 we got the following improvement on screen.display:

| [screen_display 24x80] cat-gpl3.input->Screen                | 656 us   | 135 us: 4.86x faster            |
| [screen_display 24x80] cat-gpl3.input->DiffScreen            | 647 us   | 131 us: 4.93x faster            |
| [screen_display 24x80] cat-gpl3.input->HistoryScreen         | 693 us   | 137 us: 5.07x faster            |
| [screen_display 24x80] find-etc.input->Screen                | 672 us   | 84.6 us: 7.94x faster           |
| [screen_display 24x80] find-etc.input->DiffScreen            | 662 us   | 83.4 us: 7.94x faster           |
| [screen_display 24x80] find-etc.input->HistoryScreen         | 718 us   | 85.1 us: 8.43x faster           |
| [screen_display 24x80] htop.input->Screen                    | 602 us   | 246 us: 2.45x faster            |
| [screen_display 24x80] htop.input->DiffScreen                | 599 us   | 244 us: 2.46x faster            |
| [screen_display 24x80] htop.input->HistoryScreen             | 604 us   | 250 us: 2.42x faster            |
| [screen_display 24x80] ls.input->Screen                      | 660 us   | 137 us: 4.82x faster            |
| [screen_display 24x80] ls.input->DiffScreen                  | 663 us   | 136 us: 4.89x faster            |
| [screen_display 24x80] ls.input->HistoryScreen               | 678 us   | 136 us: 4.97x faster            |
| [screen_display 24x80] mc.input->Screen                      | 563 us   | 277 us: 2.03x faster            |
| [screen_display 24x80] mc.input->DiffScreen                  | 551 us   | 285 us: 1.93x faster            |
| [screen_display 24x80] mc.input->HistoryScreen               | 574 us   | 277 us: 2.07x faster            |
| [screen_display 24x80] top.input->Screen                     | 644 us   | 154 us: 4.19x faster            |
| [screen_display 24x80] top.input->DiffScreen                 | 649 us   | 152 us: 4.26x faster            |
| [screen_display 24x80] top.input->HistoryScreen              | 663 us   | 158 us: 4.20x faster            |
| [screen_display 24x80] vi.input->Screen                      | 623 us   | 165 us: 3.77x faster            |
| [screen_display 24x80] vi.input->DiffScreen                  | 622 us   | 170 us: 3.66x faster            |
| [screen_display 24x80] vi.input->HistoryScreen               | 647 us   | 169 us: 3.84x faster            |

For larger geometries we made screen.display x10, x100 and almost x1000 faster.

For stream.feed we got a minimal improvement and a minimal regression (*)

| [stream_feed 24x80] cat-gpl3.input->Screen                   | 48.3 ms  | 49.2 ms: 1.02x slower           |
| [stream_feed 24x80] cat-gpl3.input->DiffScreen               | 46.7 ms  | 47.6 ms: 1.02x slower           |
| [stream_feed 24x80] cat-gpl3.input->HistoryScreen            | 155 ms   | 149 ms: 1.04x faster            |
| [stream_feed 24x80] find-etc.input->DiffScreen               | 92.6 ms  | 96.7 ms: 1.04x slower           |
| [stream_feed 24x80] find-etc.input->HistoryScreen            | 319 ms   | 303 ms: 1.05x faster            |
| [stream_feed 24x80] htop.input->Screen                       | 21.9 ms  | 21.2 ms: 1.03x faster           |
| [stream_feed 24x80] htop.input->DiffScreen                   | 21.6 ms  | 21.2 ms: 1.02x faster           |
| [stream_feed 24x80] ls.input->Screen                         | 2.29 ms  | 2.23 ms: 1.03x faster           |
| [stream_feed 24x80] ls.input->DiffScreen                     | 2.19 ms  | 2.22 ms: 1.02x slower           |
| [stream_feed 24x80] ls.input->HistoryScreen                  | 7.17 ms  | 6.87 ms: 1.04x faster           |
| [stream_feed 24x80] mc.input->HistoryScreen                  | 46.5 ms  | 45.4 ms: 1.02x faster           |
| [stream_feed 24x80] top.input->Screen                        | 2.49 ms  | 2.41 ms: 1.03x faster           |
| [stream_feed 24x80] top.input->DiffScreen                    | 2.54 ms  | 2.45 ms: 1.04x faster           |
| [stream_feed 24x80] top.input->HistoryScreen                 | 7.69 ms  | 7.28 ms: 1.06x faster           |
| [stream_feed 24x80] vi.input->Screen                         | 4.72 ms  | 4.53 ms: 1.04x faster           |

(*) I don't thing that the results of stream.feed are meaningful and the discrepancies look like more due the noise. In a separated analysis about pyperf (the tool that we use for the benchmark), it seems that it uses the average instead of the minimum of the samples so this will make the results slightly unstable)

Full results are in benchmark_results/: one file has the performance for 0.8.1 while the other includes the optimizations. These benchmark were executed with the auxiliary script fullbenchmark.

selectel / pyte