Questions about possible PR for performance improvements

Before I spend further time on cleaning it up (currently it works well for me), or worse, just fire off a pull request as-is; I'd like to know if you would be open to some performance improvements that change the internals of the Graphemes type¹ and relegates the existing grTransitions map to use only at startup to generate a map of all the possible transitions²? If you are open to it, do you have any preferences for such a PR (over and above the tview contributing guide which you've mentioned applies here)? E.g. do you prefer such a PR to be all-in-one single-commit or split up piece-by-piece into multiple commits changing one logical piece at a time with commit comments describing each step in detail (which could make code review easier)?

I have code that makes the above changes (plus some other simpler ones) that together improves performance significantly on my (admittedly currently simple³) benchmark testcases. benchstat reports ~-45% cpu and ~-65% allocations. The only performance regression appears to be the (hopefully rare) case of client code doing repeated Reset/Runes calls on the same Graphemes object; i.e. something like the following, when the outer loop count is high, is ~10% slower and does more allocations (one small one per Runes call vs. two large ones per NewGraphemes call):

for g := uniseg.NewGraphemes(str); someCondition; g.Reset() {
    for g.Next() {
        doSomething(g.Runes())
    }
}

Also, I'm cognisant of the valid concern that optimisation should not (usually) be done at the cost of readability/complexity/maintenance; I personally find the changes I have to mostly improve readability (but of course that can be subjective opinion) and improve (or not change) maintainability.

Let me know what you think, thanks!

¹ Replacing the constructed codePoints and indices rune slices with just the original string. I only imagine this could be an issue if you have ideas for future features/changes that relied on the existing pre-built codePoints or indices slices. E.g. something like randomly addressing grapheme clusters by index rather than the current sequential access. (By the way, a side effect of this is that the Runes method would no longer return a sub-slice of internal state that the client can, but shouldn't, change.)

² Currently my change is in the form of:

var grAllTransitions = func() map[…]… {
    var grTransitions = … existing map …
    for each grXXX, prXXX combination {
        … make an entry using existing code logic to find transition …
    }
    return new map
}()

(Here grAllTransitions is a map as a package global variable; grTransitions is a local variable so that it is either stack allocated or garbage collected after initialisation, partially mitigating the increased package memory requirements.) Next() then just does a simple lookup in grAllTransitions avoiding the run-time logic of picking between entries. Currently this changes the 30 entry map into a 165 entry map (but the map entries are each slightly smaller as they can exclude the rule number; overall with some other type changes the total map size goes from ~2.7 to ~10 Kbytes on amd64). I know the properties code has a comment indicating that it only has a subset of Unicode properties so a possible concern would be if you intend/expect to need many more properties (or states) than currently in use (11 states × 15 properties = 165 entries). An alternative with the same performance benefits would be to move the grTransitions map to a source file only used by go generate to generate a static version of grAllTransitions. This would have the benefit of no package initialisation code but at the cost of go generate complexity.

³ Exact benchmark values depend heavily on the input string length; one TODO item I have is to add benchmarks for short, medium, and long inputs as well as variation from plain ASCII through to all multi-rune-grapheme-clusters inputs. Any suggestions on sources of more representative inputs to benchmark against? I'm currently just ramming together the original field of testCases[10:20] as a single input to the benchmarks :frowning_face: (i.e. a single input of 139 bytes, 44 runes, 18 grapheme clusters). Also, the current benchmark results linked to above are effectively doing:

BenchmarkGraphemeXXX/New:   benchLoop { NewGraphemes(s);            for g.Next { g.XXX() } }
BenchmarkGraphemeXXX/Reset: NewGraphemes(s); benchLoop { g.Reset(); for g.Next { g.XXX() } }

Before I spend further time on cleaning it up (currently it works well for me), or worse, just fire off a pull request as-is; I'd like to know if you would be open to some performance improvements

Yes, I think users would greatly benefit from performance improvements to this package.

If you are open to it, do you have any preferences for such a PR

Unless it's like 20 new code files to review, I'm ok with it coming in in one piece. But no "rough drafts" please, it should be in a good state from your point of view already. Generally, your code should include enough comments to understand what's going on. I should not have to rely on commit comments to understand your code. Check the existing code (e.g. the Next() function) for an example.

Be prepared, however, for this to be a long procedure. As mentioned quite a few times over at tview, I can only take care of my open source projects in my free time, of which I don't have much. Replies might take a while to come in, especially when it requires me to read code. (And I don't usually merge PRs without reviewing them.)

Lastly, we need to remain backwards compatible. In this case, it would mean the Runes() method can't change. I know it's used in a few projects (including tview and go-runewidth) where the individual runes are needed. Please let me know how this requirement affects your changes.

I'm ok with it coming in in one piece. Generally, your code should include enough comments to understand what's going on. I should not have to rely on commit comments to understand your code.

Absolutely, the idea of multiple commits was mostly to be able to easily benchmark the individual changes in case some are deemed to be have a questionable cost:benefit ratio. E.g. in addition to the big two changes mentioned already, I added state and prop types for the gr… and pr… constants instead of the current map[[2]int][3]int. This improves type safety (although it's only for a package internal type) and I found a small additional performance benefit of ~2-3% (if memory serves) of making the types uint32 rather than int (e.g. type state uint32 and type prop uint32; makes sense since they're used repeatedly to index the map via what's now effectively a struct{uint32,uint32} so Go likely uses a 64bit map/hash optimisation internally).

But no "rough drafts" please, it should be in a good state from your point of view already.

Absolutely. The only changes to the PR code I'd expect would be due to matters of taste (as with #11) or with requests to undo/not-make some of the minor performance changes.

Be prepared, however, for this to be a long procedure.

That doesn't bother me as long as I'm aware. As I said, the changes already work for me and all my Go stuff already uses it via a replace directive in my "global" go.mod. I'll wait to revisit/submit once #11 is finalised one way or the other so the PR is based on whatever happens there.

Lastly, we need to remain backwards compatible. In this case, it would mean the Runes() method can't change. I know it's used in a few projects (including tview and go-runewidth) where the individual runes are needed. Please let me know how this requirement affects your changes.

To be clear, the API and "expected" semantics doesn't change at all. Only that users currently can do something "they shouldn't" that has a side effect it no longer would.

That is, currently a user can modify the contents of the returned []rune and since it's a sub-slice of Graphemes.codePoints such changes can effect it. In particular, it's dangerous if a user appends to the slice before the next g.Next call (btw, this part could be prevented by returning g.codePoints[g.start:g.end:g.end]); but any changes the caller makes to the slice contents will effect the results of processing after a Reset. Since my optimisations defer the []rune creation to only when needed for callers of Runes, they each effectively get their own private rune slice they can modify however they want without effecting anything else. I'd hope no one is doing such modifications expecting to effect the underlying state of Graphemes.

Referring to https://github.com/rivo/uniseg/pull/15#issuecomment-907112800, I think you should go ahead and submit a PR so we can start to work on this. Since your changes appear to overlap with #15, it's probably worth thinking about merging #15 first, then apply your changes? Or would this interfere with what you had in mind?

In any case, it would be best to just see what you came up with, then we'll take it from there.

You may have noticed that this package has gotten a major update, including new functions that can replace the Graphemes class in most if not all cases. (The Graphemes class is still there for backwards compatibility and for convenience to those users who don't care much about performance.)

The benchmarks show that the new functions are much much faster than using the Graphemes class. I think this is due to the allocation of the class and its member variables as well as accessing them. The functions (e.g. Step()) don't allocate anything and can also be inlined by the compiler (I haven't checked if it does but the benchmarks suggest that).

The downside of using the new functions is that it exposes the state of the internally-used finite-state machine. But users don't really need to care. The documentation provides plenty of examples on how to make it work. It's only a very small price to pay.

I would consider this issue closed then.

rivo / uniseg

Questions about possible PR for performance improvements #10