keymanapp / keyman

Keyman cross platform input methods system running on Android, iOS, Linux, macOS, Windows and mobile and desktop web
https://keyman.com/
Other
390 stars 108 forks source link

feat(web): pred-text performance enhancements #10414

Open jahorton opened 8 months ago

jahorton commented 8 months ago

After the other things already noted... the last thing I can presently see is that certain properties of search nodes that could be computed based on previous nodes... aren't - they're rebuilt from scratch instead. Yes, the arrays are typically small... but as word size grows, a word of M letters needs O(M^2) time for the properties to build across all steps taken to reach it - O(M) time to build, M times. Dropping this to O(M) is feasible - O(1) to build, M times. This "O(M^2)" cost likely only surfaces as a 'larger coefficient' on the true runtime complexity of each correction-search step. Reducing this "coefficient" could still be a decent improvement, though.

This is in reference to the following SearchNode properties:

  • knownCost, inputSamplingCost (and their sum currentCost)
  • mapKey (used to prevent re-running nodes that were already reached via different path)

The proportion of code affected is a bit hard to discern on the surface of things; the impact is somewhat scattered. For a loose estimate, I believe that this affects no more than 1/5 of the current correction-algorithm runtime. The 1/5 figure gives us an absolute maximum speedup of 25% if we could completely eliminate such sections (1 / 0.8, as per Amdahl's law)... and only a 11.1% speedup if we merely halve their runtime. Quartering the runtime: 1 / .85 => 17.6% speedup, and I'd be quite surprised if we could do better than that. It's certainly not nothing, but I'm pretty sure we're entering the realm of limited ROI for further optimization pursuits.

Originally posted by @jahorton in https://github.com/keymanapp/keyman/issues/10127#issuecomment-1880586159

Reference this and other comments in #10127 for more details.

jahorton commented 3 months ago

Profiling the current state of predictive text from 17.0 on a SM-T350:

Far and away, the correction search uses most of the predictive-text runtime when significant typos occur. getBestMatches is the primary entry-point for correction searching.

Profile 1: `getBestMatches` breakdown

A first-level breakdown for getBestMatches:

image

From there, the biggest costs are at the following points:

The thing is, the big, expensive operations there... boil down to iteration + O(1) conditionals. We're just iterating over that much data trying to find corrections that work. It is done via synchronous generators (partly due to the trie's structure) for that iteration, and as best as I can tell, the back and forth between generator method and iteration based upon its results seems to be the biggest runtime contributor.

The parts talked about in the base description appear to be comparatively small in comparison to this, so I believe that optimizing on that basis to be extremely low-priority. The two parts I can identify at this time:

These two sum up to about 100ms total out of 2070ms. They're not completely negligible, and are likely easy to update... but the returns are pretty marginal.

If anything, finding a way to make iteration through the dictionary more efficient would yield better returns. It is rather surprising just how high the accumulated costs are for that in comparison to the other components. We generally do iterate over all the children within the correction-search process, so building a full array and returning that may be notably better.

https://stackoverflow.com/a/70672133 suggests that simply stashing each returned value into an array, in place of each yield, could possibly double the speed of such sections.

jahorton commented 3 months ago

I took the time today to experiment with this and see if replacing the generator component of LexiconTraversal - its .children() member - might help. I didn't get significant results from the attempt though - it doesn't appear to make much difference, likely since the overhead is comparatively smaller to certain aspects of Trie navigation.

If there was to be a noticeable effect from changing the LexiconTraversal setup, we'd expect to see strong differences in buildInsertionEdges and buildSubstitutionEdges. The former was notably smaller, but the latter was notably larger. It's possible that this was due to random variations, since the profiling run was only about 15-18 seconds for both runs. I noticed a much larger value in a higher-level "total time"... that was largely due to a "V8.StackGuard" entry, which isn't something we're responsible for.

I used the Android app on a SM-T350, aiming to type the following string both times: tesring out predictions without the geneeator function. (Note: deliberate typos were used here, since that's part of what predictive-text is aimed to correct.)