x/tools/gopls: express completion scoring in terms of down-ranking factors

findleyr commented 1 year ago

While investigating #62665, I noticed that there were lots of opportunities to optimize deep completion (see also #63263). However, one obstacle to optimization is the fact that completion candidate score adjustment occurs in multiple places, and is a combination of multiplication by numbers >1 and <1, explicit setting of score, and subtraction.

It's clear that a lot of thought and experimentation went into the current scoring, and it works well overall. However, the way scoring is expressed makes it hard to reason about and hard to refactor (because operations don't commute), and therefore hard to improve.

Additionally, it means that we can never determine that the score is already too low and stop doing work. For example, in one of our kubernetes benchmarks we spend approximately 80ms in type checking and "normal" completion combined, and then another 100-200ms doing deep completion. All of that extra work produces at most 3 additional candidates. It is likely that by doing cheaper scoring operations first, we could immediately invalidate most of the candidates without having to do more expensive operations such as fuzzy matching and type inference.

I therefore believe that a first step toward improving completions should be to enforce the following rule: scores only go down, by multiplicative factors. In other words:

no adding/subtracting to the score
no multiplying the score by a number >1
no explicit setting of the score to a constant

I think we can make this change in the existing codebase without changing anything else: each scoring operation needs to be recalibrated to express its outcome using a down-ranking factor.

Unfortunately, there is a lot of knowledge embedded in the current scoring, and therefore making this change naively is likely to result in inferior completions in the short term. @pjweinb has been looking at some large scale testing that may be useful for calibrating completion results, and may help us make this change without regression (or, perhaps, while simultaneously improving results).

CC @adonovan @muirdm @heschi, who have experience with the completion code. WDYT?

muirdm commented 1 year ago

I'd be curious to see how often the chosen completion candidate is a deep candidate. Do we have any data on that? It would also be good to know at what point doing our search do we typically find the winning candidate (e.g. do we normally find it during the first 10ms of deep searching?).

we can never determine that the score is already too low and stop doing work

Stopping earlier for leaf objects makes sense, but for intermediate objects like structs it is hard to stop early since one of the struct fields could be the needle in the haystack (and the intermediate objects are what explode the search space). Whether it is score or something else, using heuristics to prune/limit the search space would certainly help a lot.

therefore making this change naively is likely to result in inferior completions in the short term

Hopefully most of the adhoc score adjustments have @rank tests. There will probably be a lot of noisy failures from the other completion tests that expect an exact list of candidates, but those maybe can be updated to use @rank to test important relative ordering, and otherwise not care about the order, or something.

findleyr commented 1 year ago

@muirdm thanks for the response. Responses inline:

I'd be curious to see how often the chosen completion candidate is a deep candidate. Do we have any data on that?

No! Until very recently (i.e. the last two weeks) we had no telemetry in gopls that could measure these things across a larger sample size. However, we are adding opt-in telemetry in the next release, and have already instrumented completion latency (https://github.com/golang/go/issues/62665#issuecomment-1730125731), which I would believe would have immediately highlighted the regression of #62665. I think we can also instrument selected completion items (provided the instrumentation contains no information other than metadata such as "is deep").

I think we need to gather data from real usage, because synthetic usage (or even the Go team's usage) tends to be very skewed.

Stopping earlier for leaf objects makes sense, but for intermediate objects like structs it is hard to stop early since one of the struct fields could be the needle in the haystack

Indeed. We always include all first-level struct fields (that was the "fix" that led to the regression of #62665). However, you are right that we may want to continue searching children of the current candidate even if the candidate is excluded. We'd have to refactor to more cleanly separate searching from scoring.

Hopefully most of the adhoc score adjustments have @rank tests.

Yes, there are a good number of @rank tests, though I think we should generate more before starting to tweak the scoring.

There will probably be a lot of noisy failures from the other completion tests that expect an exact list of candidates, but those maybe can be updated to use @rank to test important relative ordering, and otherwise not care about the order, or something.

Indeed, this is on my list of things to do. I've been porting the completion tests from the old marker framework to the new marker framework, and was struct by how heavy-weight the @item annotations are. In many cases, we only care about existence and rank of candidates.

golang / go

x/tools/gopls: express completion scoring in terms of down-ranking factors #63282