olikraus / ucglib

Arduino True Color Library for TFTs and OLEDs
https://github.com/olikraus/ucglib/wiki
Other
261 stars 76 forks source link

Display driver architecture performance bottleneck #141

Open fornellas opened 3 years ago

fornellas commented 3 years ago

I have been working on a driver for ST7789 (#132). While doing so, I learned that the current model for HAL has too many layers that do "nothing" and make things runs at half the max speed.

The problem

Currently we have to define a function with:

  switch(msg) {
    case UCG_MSG_DEV_POWER_UP:
      ...

and it is set at com_cb at struct _ucg_t. This is used by other layers, like ucg_com_SendString. For example, this code:

  for(ucg_int_t i = 0; i < ucg->arg.len; i++ ) {
    ucg_com_SendString(ucg, 2, buff);
    ucg->arg.pixel.pos.x+=dx;
    ucg->arg.pixel.pos.y+=dy;
  }

translates to this at the wire:

image

SCK is running at 20MHz, but we still have half the time spent outside of this loop from UCG_COM_MSG_SEND_STR. If we remove ucg_com_SendString(ucg, 2, buff) and use this:

  for(ucg_int_t i = 0; i < ucg->arg.len; i++ ) {
    for(int j=0;j<2;j++)
      spi_send(SPI1, buff[j]); 
    ucg->arg.pixel.pos.x+=dx;
    ucg->arg.pixel.pos.y+=dy;
  }

Then things run twice as fast, with negligible delay between sending bytes:

image

This is the difference between refreshing the screen at ~11Hz to ~20Hz! I did some math, this shaves ~50 clock cycles from my STM32 running at 84MHz.

Fix

We add another member to struct _ucg_t, which is a struct, with pointers to functions for each of the cases from com_cb, like UCG_COM_MSG_POWER_UP, UCG_COM_MSG_POWER_DOWN etc:

struct _ucg_com_cb_funcs {
  void (*power_up)(ucg_t *, ucg_com_info_t *);
  void (*power_down)(ucg_t *);
  void (*delay)(ucg_t *, uint16_t);
  void (*change_reset_line)(ucg_t *, uint8_t);
  void (*change_cd_line)(ucg_t *, uint8_t);
  void (*change_cs_line)(ucg_t *, uint8_t);
  void (*send_byte)(ucg_t *, uint8_t);
  void (*repeat_1_byte)(ucg_t *, uint16_t, uint8_t);
  void (*repeat_2_bytes)(ucg_t *, uint16_t, uint8_t[2]);
  void (*repeat_3_bytes)(ucg_t *, uint16_t, uint8_t[3]);
  void (*send_str)(ucg_t *, uint16_t, uint8_t *);
  void (*send_cd_data_sequence)(ucg_t *, uint16_t, uint8_t[]);
};

struct _ucg_t {
  struct _ucg_com_cb_funcs com_cb_funcs;
  ...

ucg_Init can be updated to support this new struct (or we can have an alternative constructor).

This allows avoiding ucg_com_SendString altogether allowing things to be fast:

  for(ucg_int_t i = 0; i < ucg->arg.len; i++ ) {
    ucg->com_cb_funcs.send_byte(ucg, buff[0]);
    ucg->com_cb_funcs.send_byte(ucg, buff[1]);
    ucg->arg.pixel.pos.x+=dx;
    ucg->arg.pixel.pos.y+=dy;
  }

Benchmarks:

Delay between sending each pack of 2 bytes for a few cases:

Compatibility

ucg_Init can have its signature updated to accept struct _ucg_com_cb_funcs as well or we can add another constructor. We can also set defaults for these values at the constructor:

  ucg->com_cb_funcs.power_up = default_power_up;

and forward calls to the existing com_cb function:

static void default_power_up(ucg_t *ucg, ucg_com_info_t *ucg_com_info) {
  ucg->com_cb(ucg, UCG_COM_MSG_POWER_UP, 0, ucg_com_info);
}

This allows all display drivers to continue to work, but some drivers can start onboarding with this new system.

PR

I can write a PR for this implementation, if there's willingness to merge it.

fornellas commented 2 years ago

For those interested, I've given up on fixing ucglib due to lack of response on various PRs / issues, so I've solved the problems by writing a shiny new library https://fornellas.github.io/eglib/ which solves many of the architectural problems with ucglib.