I've been looking into improving the performance of realLength, and reducing our reliance on shortcuts. I have sought advice and benchmarked a few different approaches on stackexchange.
The long and the short of it is that it seems we can do better, but we also have some choices to make. By far the best approach is making a giant unboxed array the size of all unicode points, and performing an array lookup. This will improve performance for all characters (except for ASCII control characters), but involves a significant memory and set-up cost. On my system it requires about 368MiB of memory, and about 150ms set-up cost.
Is this worthwhile? If you're working on ASCII the set-up cost is paid off after about 150 million lookups, while for text without shortcuts the payoff will come after only about 6 million lookups. But we would get a huge savings in code complexity, with no more shortcuts needed at all.
There are other improvements that can be made as well, in particular writing the binary search tree directly, allowing it to be specialised for our use case. This would not give as dramatic a speedup, but may allow us to maintain ASCII performance and get away with fewer shortcuts.
I've been looking into improving the performance of realLength, and reducing our reliance on shortcuts. I have sought advice and benchmarked a few different approaches on stackexchange.
The long and the short of it is that it seems we can do better, but we also have some choices to make. By far the best approach is making a giant unboxed array the size of all unicode points, and performing an array lookup. This will improve performance for all characters (except for ASCII control characters), but involves a significant memory and set-up cost. On my system it requires about 368MiB of memory, and about 150ms set-up cost.
Is this worthwhile? If you're working on ASCII the set-up cost is paid off after about 150 million lookups, while for text without shortcuts the payoff will come after only about 6 million lookups. But we would get a huge savings in code complexity, with no more shortcuts needed at all.
There are other improvements that can be made as well, in particular writing the binary search tree directly, allowing it to be specialised for our use case. This would not give as dramatic a speedup, but may allow us to maintain ASCII performance and get away with fewer shortcuts.