Please optimise the SSD1306Ascii::write(const char* s)

John-Lluch commented 5 years ago

Please replace your existing code for the write(char*) function by this!. The use of strlen() for no reason in your original code just drives me bananas :-)

That's my suggested change:

size_t SSD1306Ascii::write(const char* s) {
   size_t i = 0;
   while ( s[i] ) write(s[i++]);
   return i;`
}

Keep the good work!!

greiman commented 5 years ago

I decided to remove write(const char*s) from the library. This results in using Print::size_t write(const char *str) for write() calls.

I tested a modified version of the Spi128x64 example.

here is the timed section of the example.

  uint32_t m = micros();
  oled.clear();  
  oled.println("Hello world!");
  oled.println("A long line may be truncated");
  oled.println();
  oled.set2X();
  oled.println("2X demo");
  oled.set1X();
  oled.print("\nmicros: ");
  oled.print(micros() - m);

First I ran it as is with println() for strings. It runs in 29532 μs, 3510 bytes of flash and107 bytes of RAM.

I then replaced the println() calls with print() and added a new line to the strings.

  uint32_t m = micros();
  oled.clear();  
  oled.print("Hello world!\n");
  oled.print("A long line may be truncated\n");
  oled.print('\n');
  oled.set2X();
  oled.print("2X demo\n");
  oled.set1X();
  oled.print("\nmicros: ");
  oled.print(micros() - m);

This version runs in 29328 μs, 3484 bytes of flash and107 bytes of RAM.

Next I replaced the print() calls with write() like this.

  uint32_t m = micros();
  oled.clear();  
  oled.write("Hello world!\n");
  oled.write("A long line may be truncatedn\n");
  oled.write('\n');
  oled.set2X();
  oled.write("2X demo\n");
  oled.set1X();
  oled.write("\nmicros: ");
  oled.print(micros() - m);

My write() runs in 29328 μs, 3540 bytes of flash and 107 bytes of RAM. My write takes the the same time as print().

Your write() runs in 29316 μs, 3514 bytes of flash and 107 bytes of RAM.

I then removed write(const char* s) from the library and put a using Print::write; in the library.

The result is 29388 μs, 3484 bytes of flash and 107 bytes of RAM.

So using print() is as good as using write() but uses a few bytes less flash. 29328 μs for print() vs 29316 μs for your write() is not worth 30 bytes of flash.

greiman commented 5 years ago

I updated the library as version 1.2.3.

This insures that the following Print class write() member functions are used.

size_t write(const char *str);
size_t write(const char *buffer, size_t size);
size_t write(const uint8_t *buffer, size_t size);

John-Lluch commented 5 years ago

Thanks for the change. I agree that the SSD1306Ascii::write(const char *s) function was indeed redundant. Although oddly enough the Print::write(const char *s) implementation also uses the despised strlen() function

greiman commented 5 years ago

I don't remember how my bad version of write(const char*) happened in SSD1306Ascii.

The Arduino Print use of strlen() is the best implementation. I am at least partially responsible for it. When I introduced SD libraries to Arduino, Print was very slow since all calls to print and write called single byte write. The overhead to setup for write is very high on some devices so Print was changed to call multi-byte write if possible.

Here is an example that shows the strlen() version is about ten times faster on SD cards.


#include "SdFat.h"

SdFat sd;
File file;
const char *str = "0123456789abcdefghijklmnopqrstuvwxyz\n";

void writeLoop(const char* s) {
  while (*s) file.write(*s++);
}

void writeStrlen(const char* s) {
  file.write(s, strlen(s));
}

void setup() {
  Serial.begin(9600);
  Serial.println("Type any character to start");
  while (!Serial.available()) {}
  if (!sd.begin(10)) {
    Serial.println("sd.begin failed");
    return;
  }
  if (!file.open("strlen.txt", O_WRONLY | O_CREAT | O_TRUNC)) {
    Serial.println("open strlen failed");
    return;
  }
  uint32_t m = micros();
  for ( int i = 0; i < 1000; i++) {
    writeStrlen(str);
  }
  m = micros() - m;
  Serial.print("strlen micros: ");
  Serial.println(m);
  file.close();

  if (!file.open("loop.txt", O_WRONLY | O_CREAT | O_TRUNC)) {
    Serial.println("open strlen failed");
    return;
  }
  m = micros();
  for ( int i = 0; i < 1000; i++) {
    writeLoop(str);
  }
  m = micros() - m;
  Serial.print("loop micros:   ");
  Serial.println(m);
  file.close();
}
void loop() {}

Here is the output:

Type any character to start strlen micros: 131840 loop micros: 1267168

John-Lluch commented 5 years ago

Hi Greiman. Well, I suspect that when implementing a derivate of the Printclass, we are supposed to override both the write(const uint8_t *buffer, size_t size) and the write(uint8_t ch) functions, in order to prevent the compiler to execute virtual function calls to the overriden single byte write.

So, I suppose that the implementation of the write(const char *str) in the Printbase class by using strlen and calling write(const uint8_t *buffer, size_t size) is just an attempt to avoid too many virtual calls to the single byte write function, which is still faster despite the use of strlen. (provided that the write(const uint8_t *buffer, size_t size) is implemented in the derived class, of course).

Anyway, the tradeoff is whether we want a slightly faster overall implementation or a slightly smaller code. That's most of the time a personal choice. I wonder if you could implement the write(const uint8_t *buffer, size_t size) as a standalone, highly optimised function, which I assume it's the most called one after all, and then possibly implement the single byte write function as a single byte call to the later, rather than implementing the single byte write alone. That would possibly give the best trade off between code size and performance.

During my professional days at software development (now I’m retired) I tended to implement all my virtual functions managing streams of data by passing byte arrays (including size) rather than single byte primitives, because in most cases that resulted in much faster code. In complex apps this approach helps to prevent millions of virtual calls that tend to grow exponentially as software complexity increases. That was for me just a rule that I followed all the time. But I acknowledge that in this particular case it may not be such a big improvement.

greiman commented 5 years ago

With SSD1306 it's not worth optimizing write(const uint8_t*, size_t). Writing one character involves sending many bytes over the slow I2C bus or in the case of my example above, the SPI bus at 8 MHz.

I did optimize use of the 32 byte I2C buffer in Wire for sending font bytes. This was a big win since each character becomes 5 to as many as 50 font bytes and each font byte becomes many I2C bytes if sent byte at a time.

The above example took about 29 ms to write about 60 characters. This is 500 μs per character so optimization of write(const char) or write(const uint8_t size_t) is not be worth doing.

One other thing. write(const char*) is often use with a literal string. The way Print is implemented strlen is not called.

The gcc compiler replaces strlen() with the size of the string since write(cons char*) is declared inline in the Print class.

    size_t write(const char *str) {
      if (str == NULL) return 0;
      return write((const uint8_t *)str, strlen(str));
    }

write("Literal") will become write("Literal", 7).

You can experiment with gcc here.

Here is an example for avr gcc 4.6.4:

#include <string.h>
size_t write(const void*, size_t);
inline size_t write(const char* s) {
    if (!s) return 0;
    return write(s, strlen(s));
}
int main() {
    write("Literal");
    write("again");
}

Produces:

.LC0:
        .string "Literal"
.LC1:
        .string "again"
main:
        ldi r24,lo8(.LC0)
        ldi r25,hi8(.LC0)
        ldi r22,lo8(7)
        ldi r23,hi8(7)
        rcall write(void const*, unsigned int)
        ldi r24,lo8(.LC1)
        ldi r25,hi8(.LC1)
        ldi r22,lo8(5)
        ldi r23,hi8(5)
        rcall write(void const*, unsigned int)
        ldi r24,lo8(0)
        ldi r25,hi8(0)
        ret

So strlen("Literal") is replaced with a load of a constant.

greiman / SSD1306Ascii

Please optimise the SSD1306Ascii::write(const char* s) #55