On the bottom of http://bjoern.hoehrmann.de/utf-8/decoder/dfa/, there's an alternative implementation with a pre-multiplied table, which saves a bit shift. So basically an instruction less in the icache and some bytes less in the data cache. I microbenchmarked it, and it was 10% faster in a tight loop.
Since it's copy and paste work, I think it might be a good idea to use it.
On the bottom of http://bjoern.hoehrmann.de/utf-8/decoder/dfa/, there's an alternative implementation with a pre-multiplied table, which saves a bit shift. So basically an instruction less in the icache and some bytes less in the data cache. I microbenchmarked it, and it was 10% faster in a tight loop.
Since it's copy and paste work, I think it might be a good idea to use it.