Open scottming opened 1 year ago
Awesome context. Two comments:
:elixir_tokenizer.tokenize
is private API. If you feel you need to resort to it, open up an issue and we will expose the functionality you need somehow. I think in this particular case, cursor_context
may be a better fit (or a combination of both). If not, we will be glad to expand either.
The compilation tracer should also run for script files, you can add the tracer using Code.put_compiler_option
. Similarly, you can tweak the parser_options
to include range information for blocks. We will be glad to bake range information for tokens into the compiler as well (but those are typically easy to compute from the tokens themselves). Once again, glad to add what is necessary to Elixir to make your life easier. :)
@scottming I've read your comment a couple of times now, and here's how I parse it:
Your solutions are: For the indexing operation, we need to use the compile tracer, and the on-disk format should be SICP
I don't have a problem with using the compile tracer to build the index, as it provides a lot of information and, I think it should be comprehensive, except for code in the standard library and erlang code, which is a rather large omission. Going to the definition of a stdlib function is pretty important, as is seeing references to stdlib functions. This needs to be solved. Maybe a combined approach (BEAM file reading for stdlib and compile tracer for everything else) would be helpful here.
I've read both the SCIP and LSIF documentation, and to me, they're more about sharing metadata about an index rather than creating an on-disk representation of an index. Mentally, you can think of this like Postgres or MySQL having their database dump formats in SQL, which is human readable, but tells us nothing about the inner workings of the database, and allows the inner workings to change.
A project like Lexical has different needs than either sourcegraph or the LSIF framework; the goals of both projects are about creating a format that can be used to create an index, rather than making a format that is an index. It seems much easier to me, for lexical's purposes, to create an index in memory in ETS and then serialize that to disk via DETS. Using otp-native tooling will allow us to iterate a lot more quickly than bringing up nascent tooling for either LSIF or SCIP, and even if we had perfect tooling for either, we'd still need to convert either format into something native that we could actually make use of. Neither format is actually queryable.
As for the in-memory index representation, here's a proposal Note: this is a starting point for discussion, and the numbers are hand-counted, so please forgive off by one errors.
Given the following file ...and the elixir stdlib,
defmodule MyModule do
def my_function(a, b) do
trimmed = if String.contains?(a, "evil") do
String.trim(a)
else
a
end
a <> b
end
Name | Arity | Type | Subtype | Scope | Start Position | End Position | URI | File Hash | mtime | |
---|---|---|---|---|---|---|---|---|---|---|
"String.contains?" | 2 | :function | :definition | nil | {2491, 7} | {2491, 16} | "file://path/to/string.ex" | md5("string.ex") | mtime("string.ex") | |
"MyModule" | nil | :module | :definition | nil | {1, 1} | {1, 22} | "file:///path/to/project/file.ex" | md5("file.ex") | mtime("file.ex") | |
"MyModule.my_function" | 2 | :function | :definition | {MyModule, :my_function, 2} |
{3, 4} | {3, 27} | "file:///path/to/project/file.ex" | md5("file.ex") | mtime("file.ex") | |
"a" | nil | :variable | :declaration | {MyModule, :my_function, 2} |
:definition | {3, 20} | {3, 21} | "file:///path/to/project/file.ex" | md5("file.ex") | mtime("file.ex") |
"b" | nil | :variable | :definition | {MyModule, :my_function, 2} |
{3, 23} | {3, 24} | "file:///path/to/project/file.ex" | md5("file.ex") | mtime("file.ex") | |
"String.contains?" | 2 | :function | :reference | {MyModule, :my_function, 2} |
{4, 26} | {4, 35} | "file://path/to/project/file.ex" | md5("file.ex") | mtime("string.ex") | |
"String" | nil | :module | :reference | {MyModule, :my_function, 2} |
{4, 19} | {4, 25} | "file:///path/to/project/file.ex" | md5("file.ex") | mtime("file.ex") | |
"trimmed" | nil | :variable | :definition | {MyModule, :my_function, 2} |
{4, 7} | {4, 13} | "file://path/to/project/file.ex" | md5("file.ex") | mtime("file.ex") |
The ETS key would be:
{name, arity, type, subtype}
There's also an issue of lexical scope, which I haven't thought about deply yet, but is evident if you search in another file for a variable called "a" or if you want to search for an aliased module name. I've tried to handle it above, but I'm sure there will be edge cases with the way it's handled.
I've tried to solve the "switch branch" issues by including file hashes in the index. This would mean that a single index could support multiple branches and file updates rather easily, though we'd need to do some work in invalidating old indexes, which is why I included the mtime in there as well.
To query, you'd need to do the following:
{"Fully.Qualified.function", 2, :function, :definition | :reference}
b. for Modules: {"Fully.Qualified.Module", :_, :module, :definition | reference}
c. for variables: {"var_name", :variable, :definition | :reference} |> filter_by_scope(current_scope)
Enum.filter(result, fn result -> result.md5 == FileCache.md5_of(result.file) end)
I feel the advantages of starting here are:
The disadvantages are:
@josevalim Thank you for chiming in here.
I've played around with the functions in Code.fragment
and I just can't get them to be useful to us without a lot of work. It's quite possible that I don't understand them well enough to put them to use (they're not documented as well as the rest of Elixir, possibly because they're new), but we've played around with them and they give confusing results in many cases.
(the cursor position is represented by |
For example, it's not intuitive to me that the surround context of this is none:
alias String, as: MyModule
MyModule.function_call(|a, b)
Shouldn't the surround context be a function call? or a function call argument?
Similarly, if the cursor is positioned after the variable, the context is a :local_or_var
which makes sense, but I have no ability to resolve where that local or var is defined. Similarly, if the position is: MyModule.function_call|(a, b)
, then the alias is MyModule
, which is right, but it would be super helpful if the alias was resolved to its canonical name.
It's also not clear to me why cursor_context
doesn't take a cursor position (nitpick).
I can provide more examples, but I haven't played around with Code.Fragment
in a while. I was doing a lot of work with Code.Fragment
in our completion environment, but switched to use the lexer, which provided results that made more sense to me, and actually simplified the code a great deal.
I really appreciate your offers of help here, but there's a big caveat. Even if changes landed in elixir tomorrow, it'd be two years before we could update lexical to use the changes, as we want to support two full versions in each release. If we want to integrate any sooner, we might need to do something similar to what you did in GenStage so many years ago; release a small library of experimental features and prefix the module names with Experimental
. We could then integrate this library into lexical (lexical doesn't have the namespace pollution issues that elixir_ls
has) and keep lexical up to date with stuff like:
alias Experimental.Code.Fragment, as: Code.Fragment
... now we have access to experimental features.
We mentioned this in another discussion. The goal of Code.Fragment is not to resolve aliases or variables, so far it is exclusively focused on tokenization and parsing (which is also what you get when you call the tokenizer by hand). cursor_context is meant to be used for code completion. surround_context shows what you need to highlight on mouse-over. Their behaviour were taken from how editors behave for these functionalities.
I strongly do not recommend using private APIs. Please open up issues. :)
Oops, premature send.
My point is that the tools in Code.Fragment
have specific use: code completion, mouse-over, so the mismatch may be in trying to use them for other purposes. In order to know if something is inside a function call, we use container_cursor_to_quoted
.
we could then integrate this library into lexical
Yeah, if we add it to Elixir, then you can copy and paste an implementation into this project until you can depend on more recent Elixir. :)
@josevalim Thanks again for the response, I was less than clear to you. I was indeed using Code.Fragment
for completions, but the problem I was having is that a file amasses context as it is compiled, and that context builds up to the point where you need a completion. This is different than something like iex
where the context is immediate, and your completion is really only affected by the current execution environment and the current line. As far as I can tell, Code.Fragment
isn't helpful for completions in the language server sense, but makes sense in the iex
sense.
I'll open an issue about exposing the lexer and about making Code.Fragment more context-aware so it can handle lsp style completions.
As far as I can tell, Code.Fragment isn't helpful for completions in the language server sense, but makes sense in the iex sense.
They should be orthogonal concerns. The goal of cursor_context
is to tell what is being completed, the how is elsewhere. Both IEx and Livebook (more IDE-like) implement their completion with cursor_context
. That's what I am getting to, cursor_context
does not care about the environment (macros, aliases, variables) at all. In any of those cases.
Here is some sample code to hopefully help guide the discussion:
For example, the first clause says "you should suggest aliases". But it doesn't know or care in IEx case (nor any other case) from where those aliases come from.
@josevalim
Thank you for your patience and response. I understand your concerns. Regarding the specific usage of private APIs, besides the mentioned module things, I am not clear about other aspects yet. Steve may open a new issue to discuss regarding Code.Fragment
.
but those are typically easy to compute from the tokens themselves).
Yes, it shouldn't be difficult.
@scohen I think your analysis is correct. However, when I mentioned SCIP and LSIF, it was only to hope that the data structures stored in dets
or ets
could be easily converted into them. But the memory fields
that you mentioned above should fully meet these requirements, which we agree on.
Currently, these thoughts and discussions are enough to get us moving forward. Thank you very much for your thoughtful analysis and response.
They should be orthogonal concerns
I might be wrong, but I don't see it that way, for the reason I gave above. iex has one context, which is mutable and constantly being mutated. You only need to keep track of that one. An lsp context, on the other hand occurs wherever an edit occurs and needs to be built up when the cursor lands. So in your example:
"""
alias MyModule.Foo, as: OtherModule
alias Other
"""
|> Code.Fragment.cursor_context()
{:alias, 'Other'}
Now I'm likely missing something, but where's the thing that says: "Here's what the names of the modules are in the given context"? Something, somewhere needs to tell me that OtherModule
is now a valid module name.
That's what I am getting to, cursor_context does not care about the environment (macros, aliases, variables) at all. In any of those cases
That's what I'm tilting at. The completion frameworks of livebook and iex do a lot of the heavy lifting. Livebook appears to have been inspired by elixir_sense
as well (which is kind of a bummer since elixir_sense doesn't use structs, which are very nice things indeed).
Now I'm likely missing something, but where's the thing that says: "Here's what the names of the modules are in the given context"? Something, somewhere needs to tell me that OtherModule is now a valid module name.
You are correct. There is nothing in Elixir that does it at the moment (but we want to do so). In practice, you should have two functions:
One function that builds the context up to "alias Other" (not provided by Elixir)
A call to Code.Fragment.cursor_context
that will tell you which information you should retrieve from the context you build from 1 (in this case, it will tell you that you should look up the aliases information from context)
Does that make sense now? We are only solving half of the problem, but it doesn't mean the half it solves is not useful!
@josevalim yes, that makes sense, and I would love for the core team to help out with the second thing. I could help too. If needed
What is your approach for incremental compilation and how will that apply to your index? When you mix compile
, you can index everything related to the last compilation. However, as you change files (and potentially save files), your dirty state differs from the last mix compile
state. Do you plan to have two indexes? One with mix compile
and another for the dirty state? How do you build the dirty index, do you Code.eval_file
or Code.compile_file
that particular file?
@josevalim
Hi, Jose. Thank you for your continuous attention to this project.
Currently, we do not have an index yet. Regarding the Navigation functions, Steve needs some time to work on the technical architecture.
However, we do compile twice now. After entering the project, we will perform a full compilation once. Then when users edit their code, we will compile it and publish the diagnostics
in real-time by using: Code.string_to_quoted
+ Code.compile_quoted
. The main logic starts from here: https://github.com/lexical-lsp/lexical/blob/d6e2508cbd659ebb5db29fc2d45189957af8a8e8/apps/remote_control/lib/lexical/remote_control/build/state.ex#L109
Lexical has already provided numerous features that assist us in coding, such as real-time diagnostics, completion, formatting, and so forth. From my perspective, these functionalities have achieved a fairly stable level, and they indeed make the process of writing Elixir code more enjoyable.
However, our support for code reading still needs to be enhanced. To make reading code easy and enjoyable, powerful navigation functionality can improve the experience and is crucial for reading complex code. Go to definition, find references, workspace/document symbols, implementations, and hover doc are all parts of the navigation functionality, and even folding range is included (though I personally prefer the implementation of tree-sitter's folding). But at present, we have only implemented the go-to-definition.
Next, I plan to implement the Find references feature. Nevertheless, we need to consider numerous factors, including user experience and technical implementation issues.
Goal
In terms of user experience, the time spent reading code exceeds the time spent writing it for many people. so I hope the navigation features can be used instantly as soon as the editor is opened, instead of waiting for the compilation to complete. I want these functionalities to be precise, consistent, and comprehensive. Lastly, I look forward to effortlessly achieving rich code navigation in the future.
There are many ways to implement navigation. For go-to-definition or find-references, we can choose to read the beam files to get
debug_info
directly, or we can just use Compiler tracer. However,debug_info
lacks column information, which would cause us much inconvenience, so it can only serve as an auxiliary method. Moreover, the Elixir core team is also willing to provide more support regarding Compiler tracer. Therefore, my preliminary idea is: for modules and functions, we use Compiler Tracer, while for other elements, we continue to use ElixirSense.For variable
a
, the effects would look like this:Precise, Consistent, and completeness
Aiming for these goals, the Compiler tracer has performed well. Notably, not all files, such as script files, are compiled by mix compile. Therefore, we must compile .exs files at a certain time for full completeness. It's frustrating to fail to find references in test files. The Compiler tracer currently only offers location information, not range information, so we need to transform the location into range data at some point. This might involve reading files, which is best done on the server node. A challenge is identifying the functions and modules at the cursor position and accurately determining what's being referenced by comparing the index.
As for
functions
, they're simpler to address. Let's revisit the previous example:When checking the tracer information, we will find several references on line 7:
And if our cursor is in the range of, then using
{7, 5}
to{7, 10}
, which is the reference to functionCode.Fragment.surround_context(c, {7, <5-10>})
will always return:However, modules need to consider more. Tracer only provides the event of the last module, and the index may not be what we want. For example, if we define a module like this:
Then reference this module:
When your cursor is at the
Another.|IO.puts()
position:This is not a problem, but if your cursor is on
Anothe|r.IO.puts()
, you actually want to go to theMy
module.However, its result is also
Another.IO
, which is incorrect. We need to determine which module is currently under the cursor accurately. Here,:elixir_tokenizer.tokenize
should be helpful:Additionally, we need to consider the issue of modules being aliased. Here, we need to collect all alias events and convert them into a map. For example:
{:alias, [line: 3, column: 3], My, Another, [as: Another]}
.Can use Navigation immediately after launching the editor
For this goal, we need caching and to ensure that stored information can be quickly loaded after starting the project node. After loading Navigation information, a series of mix tasks will be executed. I may choose dets as the storage format for each project. Additionally, Steve has mentioned:
I think he means https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#workspace_didChangeWatchedFiles
Switching branches is tricky, meaning that much index information may need to be more or corrected after the switch. I have yet to come up with a good solution.
Rich Code Navigation
Regarding rich code navigation, I actually hope that we consider this from the beginning. As mentioned in lsif, "Wouldn't it be cool if you could do this without first cloning the repo?" This is really cool and practical, but I would prefer to implement the conversion from cached dets to SCIP from the beginning. Yes, SCIP, not LSIF. I know LSIF is Microsoft's standard, but its benefits for us now are limited. On the other hand, SCIP has more good tools and can be converted into LSIF with one command. Rust-analyzer also supports SCIP as a priority.
This article on Writing an indexer mentions how to develop an indexer. I have tried Snapshot, which is convenient and helpful for mining some edge test cases. If you want to learn more about it, there is another article here: SCIP - a better code indexing format than LSIF
Actions
After all that has been said, I will take action. I prefer the vertical slicing approach to work and will consider the three objectives comprehensively. I should start with the simplest
local function
first and support more after all processes are running smoothly.If you have further suggestions, I am happy to listen and discuss them.
cc @scohen @josevalim