Calculation of initial bounds for completion is too simplistic

joaotavora / eglot

A client for Language Server Protocol servers

GNU General Public License v3.0

2.27k stars 200 forks source link

Calculation of initial bounds for completion is too simplistic #402

Open nemethf opened 4 years ago

nemethf commented 4 years ago

This is branched off from #366, where @astoff writes:

(It is independent, but Eglot won't be able to complete "\ref{lorem ip", because (bounds-of-thing-at-point 'symbol) returns "ip" and not "lorem ip".)

(Maybe this should be a new issue, but let me add a (not so) quick side-remark. There are indeed a few places where one might want the completion text to include spaces, including cases unrelated to fuzzy completion. But even worse is the following situation: if you follow AUCTeX labeling scheme, you will get labels like sec:introduction for sections, fig:a-beautiful-picture for figures, etc. But then completion at \ref{sec:intr will fail, because the thing at point will be intr and not sec:intr, as it should. Of course the server returns all completions, but Eglot filters them out.)

The issue affects other serves as well. See test json-basic in eglot-tests.el. Commit f5151be46d9bcfcd3a524cfc3fc6c6a34587264f describes the decision behind the current behavior. But maybe we can find an alternative implementation that's also "cheap-to-run".

joaotavora commented 4 years ago

But maybe we can find an alternative implementation that's also "cheap-to-run".

If I understand this correctly, the problem is that the server's idea of "initial-bounds" might be all over the place (depending on the server, that is). There is no LSP interface for telling the LSP client what the server will be completing on. I think you once tried something @nemethf by running through all the completions possible beforehand and checking if all of them were TextEdit completions all targeting the same range. But not only is that blatantly inneficient, it's not generic enough (and I think even had some other conceptual problems). So the Emacs's bounds-of-thing-at-point is the next best thing. We don't know if will it match the server's idea of the bounds, but sometimes it will, maybe even most of the times. It's reasonably cheap to run, and is customizable at the major-mode level. If there's need to tweak it for a particular mode-server pair, don't think that customization should be Eglot's responsibility though. And, as I said, the whole approach is just plugging a hole in the LSP protocol.

nemethf commented 4 years ago

Sure, the language server protocol was designed differently than Emacs' completion system, but there are clients that found a way to cope with the protocol. So I do not think there is a hole in the protocol. (At least not in this regard.)

If there's need to tweak it for a particular mode-server pair, don't think that customization should be Eglot's responsibility though.

Who's responsibility should it be then? What could the latex-mode do? I think it should be the server and not the client that knows how to parse the language, do completion a given point, etc., so that the client could be totally dumb. Emacs's bounds-of-thing-at-point has to be taught about the syntax table of a given mode, a general LSP client shouldn't need this information.

I'm just thinking loud: would it possible to keep the current initial heuristics intact and, after the arrival of the first completion items from the server, derive bounds from the items, compare them with the initial one and somehow restart the completion process with readjusted bounds?

joaotavora commented 4 years ago

I do think there is a hole in the protocol, as I try to demonstrate in https://github.com/microsoft/language-server-protocol/issues/651 even for cases other than this one.

What you say is generally true, but the devil is in the details. Exactly how did those clients manage to "cope with the protocol"? Your final sentence is very ambitious, I wonder if it's worth it (it might be, I'm also thinking out loud).

general LSP client shouldn't need this information.

I agree, that's why I said it shouldn't be Eglot's responsibility. A good-enough solution here might be to make latex-mode do what most LSP latex servers (are there very many of them?) usually do.

nemethf commented 4 years ago

Exactly how did those clients manage to "cope with the protocol"?

It's hard for me to say because I've only seen animgifs of them. They are probably better at multithreading. Additionally, I guess they don't need a completion function that's cheap to run, because they support triggerCharacters (and commitCharacters).

A good-enough solution here might be to make latex-mode do what most LSP latex servers (are there very many of them?) usually do.

There are two of them: digestif and texlab. But I disagree, latex-mode (or Eglot instead) should do what other clients do and not what the servers do.

So what if the current completion-at-point remains a cheap function, but Eglot is extended with another completion function that runs only when the user explicitly triggers it and when the speed is not so important? Instead of company-capf, we could use a different backend company-eglot that tries to follow the specification more closely.

joaotavora commented 4 years ago

They are probably better at multithreading.

I don't understand. How is this an advantage in this specific regard?

Additionally, I guess they don't need a completion function that's cheap to run, because they support triggerCharacters (and commitCharacters).

Last time I looked we supported triggerCharacters too. Not sure what commitCharacters are. Anyway, I don't think price-per-call is the issue here, rather dealing with protocol inconsistencies in a way that isn't too complex, brittle and unmaintainable.

But I disagree, latex-mode (or Eglot instead) should do what other clients do and not what the servers do.

Regarding latex-mode, it can't by definition do what other LSP clients do, since it knows nothing about LSP. The only thing it knows about is the "symbol at point" or "thing to complete at point" and this knowledge should, if possible, match the server's, which is also an interested part in that language aspect. At least I don't see a good reason why it shoudn't, do you?

I think we need to take a step back. And maybe write a failing unit test for this. Exactly in which position of which latex source do problems arise? Exactly under what conditions does Eglot's "calculation of initial bounds", inherited directly from latex-mode's fail to match the servers?

To your second part, Eglot can't do what other clients do if we don't know what other clients do (and this is why I asked above, I am not against it in principle). If you want to explore that avenue, I think it's fair to first collect that information and then decide upon it.

Regarding your last proposal, Eglot must work with company-capf because that's the only shared interface for every completion framework, including, but certainly not limited to, company-mode. That said, I don't mean to stop you exploring an alternative company-eglot package, and it'd be interesting to see how you solve the problems there (and what other problems you come across).

astoff commented 4 years ago

The only thing it knows about is the "symbol at point" or "thing to complete at point" and this knowledge should, if possible, match the server's, which is also an interested part in that language aspect. At least I don't see a good reason why it shoudn't, do you?

The server is supposed to be smarter than the editor when it comes to its target language, so it is natural that they might disagree about what the symbol at point is. Of course this is not limited to TeX; another example I can image is completion within strings, say dictionary keys in Python. One might wish my_dict["my-k to complete to my_dict["my-key"]. The thing at point here really is "my-k but Eglot will assume it's k.

On the other hand, it seems unlikely to me that any server returns completions with different prefixes. My TeX server certainly doesn't do that. So I'd think it's reasonable to compute a tentative "thing at point, according to server" based just on the label or textEdit of the first completionItem; this should be good enough at least for the purposes of displaying the completion candidates. Then, if the user selects a candidate, one should look at that candidate's textEdit property to correct for any wrong assumptions.

protocol inconsistencies

By the way, one thing I just don't get is the filterText in the LSP spec. If each editor is allowed to filter based on its own idea of what the context is, then each editor should understand each language at least as well as the server. And if the server doesn't return only relevant completion items for the current context, then it's no better than an old-school TAGS file.

joaotavora commented 4 years ago

The server is supposed to be smarter than the editor when it comes to its target language

Indeed, and that is the whole point why Eglot users use eglot to delegate some management of their buffers to an LSP-speaking server. No argument there.

To state the obvious, Emacs users are also, well, using Emacs, so they also expect a set of functionality that might not be available in other languages, like the o ability to regexp match candidates or other functionality that are obscure to some. Much of this functionality is coded into the major-mode, along with indentation rules, syntax-highlighting, etc, etc (these last two are things that some LSP servers already do, but for which currently the major-mode still has the better solution).

Marrying these two expectations in the sometimes ungrateful job of Eglot, which isn't a very smart piece of software at all, it's just the messenger.

But this were just generalities. I'll adress the specific problem below.

, so it is natural that they might disagree about what the symbol at point is.

Not unless the server is using some other source of information that is unavailable to Emacs's runtime. In this case I don't think it is. But you might well disprove me, i.e. I suppose there could be a legitimate reason for disagreement. For example, the server could be correlating data in the actual "latex" runtime (if such a thing existed) and provide a somehow enhanced or smarter guess about what the current "thing-at-point" is.

An example out of SLIME/SLY, the Emacs Common Lisp IDEs. They do something very similar with Lisp macros: they know the objects at runtime and so they can provide the correct indentation rules. An offline Emacs will often mis-indent a common-lisp macro since it doesn't have the insight that the SLIME/SLY server counterparts do.

So he questions are, in my opinion:

how often does this happen? Does it happen in practice often enough? Do you have an actual use case that isn't merely academic?
would that case happen if the server and latex-mode agreed on what the "thing at point" is?
if there is indeed a very good reason for disagreement (such as the one I outlined above, or another) between what the "thing at point" is between the editor and the server, then we need to find an interface to let one communicate the semantics over to the other. The only LSP-abiding interface we have to do that was discovered by @nemethf some time ago and, while I commend him for the courage, I think it's excruciatingly baroque. But perhaps there are others. Or maybe, just maybe, we could convince for LSP to actually give a damn about editors other than VSC.

The thing at point here really is "my-k but Eglot will assume it's k.

It's not Eglot at all, it's python-mode. And it's not 100% certain that python-mode is at fault, and not the LSP server? I can ask: why won't the python server let me complete strings that are being used as dictionary keys? So, from this example, I still don't see a reason why they must disagree.

astoff commented 4 years ago

Not unless the server is using some other source of information that is unavailable to Emacs's runtime. In this case I don't think it is.

Well, Emacs's calculation of symbol-at-point, like syntax highlighting and so on, is a regexp-based heuristic, while LSP servers tend to do a deeper static analysis. Right?

how often does this happen? Does it happen in practice often enough? Do you have an actual use case that isn't merely academic?

In the LaTeX case, we have at least 3 practical examples: the fuzzy completion of \ref{lorem ip, non-fuzzy completion of \ref{sec:intr (as mentioned in the opening message of this issue), and completion of TikZ keyword arguments, since TikZ allow spaces in keywords. They fail because the server regards lorem ip respectively sec:intr as the completion prefix, while Emacs/latex-mode/Eglot regards ip respectively intr to be the prefix.

But I guess the Python example is even more compelling, isn't it? Here I am not sure things work as I described, since I'm extrapolating from the TeX case. But I believe completion of a string would fail if the string contents happens not to be a valid identifier (hence my example "my-key").

would that case happen if the server and latex-mode agreed on what the "thing at point" is?

In this case all is good. But should the server now try to guess what the editior is going to regard as the "thing at point"? I guess not.

It's not Eglot at all, it's python-mode.

Sure.

And it's not 100% certain that python-mode is at fault, and not the LSP server? I can ask: why won't the python server let me complete strings that are being used as dictionary keys? So, from this example, I still don't see a reason why they must disagree.

Again, to be clear, I've described (without testing) the behaviour I expect based on what happens with TeX. Namely: the server correctly computes my-key as a completion of my-k, but then Eglot discards that candidate because it doesn't match (symbol-at-point), which is k. Granted, the server could organize it's results around the editor's idea of what the prefix is, and send key instead of my-key as the candidate. I'm not sure this is a good idea.

then we need to find an interface to let one communicate the semantics over to the other

I think so. Either that, or the editor shouldn't at all attempt to filter candidates.

joaotavora commented 4 years ago

Well, Emacs's calculation of symbol-at-point, like syntax highlighting and so on, is a regexp-based heuristic, while LSP servers tend to do a deeper static analysis. Right?

Right, but do they in practice, in this situation? IOW, they might do that, and so might major modes which are not necessarily limited to regexp strategies (but often to "static" strategies, like most LSP servers).

the fuzzy completion of \ref{lorem ip

Thanks for restating the example. What I don't understand here is: why doesn't the server return the completion ipsum}? Or do you want refem to also complete to \ref{lorem ipsum}. Or maybe you just want loip to complete to lorem ipsum. Either way is fine by me, and I see no reason why latex-mode and your LSP server shouldn't agree on a common thing. Unless you want to flex-complete much more complicated patterns, like Salsrefem completing to See also \ref{lorem ipsum}., in which case latex-mode's authors would more likely object (and probably me as a user, too).

But I guess the Python example is even more compelling, isn't it? Here I am not sure things work as I described, since I'm extrapolating from the TeX case. But I believe completion of a string would fail if the string contents happens not to be a valid identifier (hence my example "my-key").

It might be more compelling, depending how the server(s) work. Maybe if it(they) matches(match) python-modes's view of the situation then there's no problem. Or maybe python-mode's view of the situation isn't as sophisticated at it could well be.

but then Eglot discards that candidate because it doesn't match (symbol-at-point), which is k

In this particular case, I think it's reasonable to do one of the following:

change python-mode to make symbol-at-point return "my-k" here;
change the server(s?) to return "key" as a completion instead.

Of course, as I said before, a way to let one side let the other side know what they think is the "thing at point" would solve all these problems.

One thing I haven't mentioned is that LSP allows, through a rather obscure use of the TextEdit features, to have different completions target different areas of the document (even theoretically completely unrelated ones). I don't think Emacs is anywhere near supporting that (and I don't think it's a brilliant idea to support that in the first place).

astoff commented 4 years ago

What I don't understand here is: why doesn't the server return the completion ipsum}?

Because it's a fuzzy match against "lorem ipsum", not anything shorter. But let me change your question so I can give the answer you are looking for.

When completing \ref{sec:intr, why is the candidate sec:introduction and not just introduction? In practice, either way should work, and my reason is kind of philosophic: ach completion has a type: macro, bibliography, etc. The completion at hand is a cross-reference label. And the label is sec:introduction, not introduction.

Of course, as I said before, a way to let one side let the other side know what they think is the "thing at point" would solve all these problems.

Like I said before, I totally agree. For the record, in case you decide to purse this with the LSP people, I believe two more pieces of information are missing from the completion method response: Whether the completion list is already filtered, and whether it's already sorted. (As I said, I don't see why any server wouldn't do both of these things, since otherwise it's just a glorified TAGS file. But it's not my protocol.) Currently my server needs workarounds to prevent the editor from resorting (e.g., fuzzy matches carefully sorted by score) and refiltering.

nemethf commented 4 years ago

I think we can enhance Eglot with the following compromise. Introduce two project-local variables with names like eglot-completion-bound-function and elgot-completion-bound-error-behavior.

The bound-function is bounds-of-thing-at-point by default, but there's another choice (prefetch-candidates), which calls proxies and calculates the bounds from that.

Once the completion items arrive from the server, Eglot can cheaply check whether its guess of the bounds was right. If the error-behavior is ignore, nothing happens. If it's warn, the default, Eglot warns the user and mentions the possibility of customizing the bound-function. If it's auto, it changes the bound-function from bounds-of-thing-at-point to prefetch-candidates and gives a different warning/error (like "variable changed, try again"). So the repeated eglot-completion-at-point will be successful (hopefully). If the bounds of the completion items are inconsistent, Eglot warns with ¯\_(ツ)_/¯ unless the error-behavior is ignore.

joaotavora commented 4 years ago

Because it's a fuzzy match against "lorem ipsum", not anything shorter.

But the end effect is the same right? Unless you are you trying to complete lips to lorem ipsum? Are you?

But let me change your question so I can give the answer you are looking for.

OK but after doing that, please also answer the actual question I gave you, since it tells me more about this issue.

In practice, either way should work

Except it doesn't. One of the ways doesn't work because LSP doesn't have a good mechanism to communicate the bounds of the thing it is trying to complete. It only has a very far-fetched mechanism for doing so that is ridden with conceptual problems (which @nemethf seems intent on tackling anyway). But until there is a decent solution, if you want to flex-match in Emacs with LSP you have to telepathically communicate these rules to latex-mode. That is to say, you have to agree on what "things at point" are.

FTR I have no idea on the kind of resistance you'll find from the latex-mode maintainers but supposedly, they are knowledgeable about latex too, and it should at least be interesting to agree on some parts (if not all).

Whether the completion list is already filtered, and whether it's already sorted. (As I said, I don't see why any server wouldn't do both of these things, since otherwise it's just a glorified TAGS file. But it's not my protocol.)

Agree on this, but seems to be orthogonal to the bounds issue.

joaotavora commented 4 years ago

Once the completion items arrive from the server, Eglot can cheaply check whether its guess of the bounds was right. If the error-behavior is ignore, nothing happens. If it's warn, the default, Eglot warns the user and mentions the possibility of customizing the bound-function. If it's auto, it changes the bound-function from bounds-of-thing-at-point to prefetch-candidates and gives a different warning/error (like "variable changed, try again"). So the repeated eglot-completion-at-point will be successful (hopefully). If the bounds of the completion items are inconsistent, Eglot warns with ¯_(ツ)_/¯ unless the error-behavior is ignore.

Doesn't sound too bad, give it a go if you want. The names need some work but the general idea is decent, on paper. "Decent" is this context would mean "fckuing great" in another context because of the yucky monster you are trying to tame. Let's see how concise and clear you can make it.

astoff commented 4 years ago

But the end effect is the same right?

Yes. All the following would have the same final effect if the user request a completion after "lorem ip":

Prefix "lorem ip", completion "lorem ipsum"
Prefix "ip", completion "ipsum"
Preifx "", completion "sum"

The last prefix choice totally canonical and would require no telepathy or other form of communication.

The point here is that the completion box should be placed below lorem ip, and not below ip (and neither just after "ip"), because the former is the actual "completion subject". Recall also that the lorem ipsum appeared in the context of fuzzy completion, https://github.com/joaotavora/eglot/issues/366, and the text lorem ip will be replace by the text 1, not by lorem 1.

joaotavora commented 4 years ago

The point here is that the completion box should be placed below lorem ip, and not below ip (and neither just after "ip"), because the former is the actual "completion subject".

In your opinion, is there any difference between the thing designated by "completion subject" and the thing designated by "thing at point"? Spoiler alert: in my mind there isn't.

The latter may also have other actions applicable on it, such as finding the definition, finding references, highlighting, evaluation in context, etc etc.

astoff commented 4 years ago

In your opinion, is there any difference between the thing designated by "completion subject" and the thing designated by "thing at point"?

No difference. The difference for me is that the server understands more of the language and might disagree with the editor's idea of what the thing at point is.

joaotavora commented 4 years ago

The difference for me is that the server understands more of the language and might disagree with the editor's idea of what the thing at point is.

Major modes can do a pretty nice job too (after all they people were programming with emacs for a pretty long time before "servers" came along). And if they don't, they can be convinced to. Why not try to at least initiate a communication with latex-mode.el's authors and point them to your server?

astoff commented 4 years ago

LSP doesn't have a good mechanism to communicate the bounds of the thing it is trying to complete

This clearly needs to be fixed. VSCode also has glitches related to this, see https://github.com/microsoft/language-server-protocol/issues/648 for an example.

Why not try to at least initiate a communication with latex-mode.el's authors

Might be a good idea, but I think the real problem here is the deficiency in LSP's completion interface.

It seems that the completion bounds/prefix is needed for 3 different things

Decide where to place the completion popup.
Filter candidates
Decide what to erase before inserting back the selected candidate. (Or, equivalently for non-fuzzy matches, decide how many characters at the end of the candidate remain to be inserted.)

Point 3 is never an issue for Eglot, at least as long as the server provides a textEdit for each candidate. Point 1 is not unimportant, but it is just a visual feedback to the user and doesn't affect the final result of the completion process. The real problem, in my view, is point 2. That's why I opened the LSP issue linked above Felicián's message..

Now, I don't know company and completion-at-point in detail, but do you agree that, if no filtering was needed on the editor side, the only problem that could arise from a wrong bound computation would be an UI glitch (point 1)?

joaotavora commented 4 years ago

Now, I don't know company and completion-at-point in detail, but do you agree that, if no filtering was needed on the editor side, the only problem that could arise from a wrong bound computation would be an UI glitch (point 1)?

Maybe, have to think about it. Maybe the two issues are indeed connected, yes.

joaotavora commented 4 years ago

Might be a good idea, but I think the real problem here is the deficiency in LSP's completion interface.

I've agreed to the second part many times. But it the meantime I hope you'll agree that it doesn't hurt to initiate that conversation.