Language Server Index Format

dbaeumer commented 6 years ago

The purpose of the Language Server Index Format (LSIF) is it to define a standard format for language servers or other programming tools to dump their knowledge about a workspace. This dump can later be used to answer language server LSP requests for the same workspace without running the language server itself. Since much of the information would be invalidated by a change to the workspace, the dumped information typically excludes requests used when mutating a document. So, for example, the result of a code complete request is typically not part of such a dump.

A first draft of a specification is available here

rcjsuen commented 6 years ago

@dbaeumer Not sure if I'm misinterpreting something here, but correct to say that this is something for LSP clients to implement and not servers to implement, yes?

LaurentTreguier commented 6 years ago

I understood it as something dumped by servers, and then used by clients afterwards, so both would have their part to implement

rcjsuen commented 6 years ago

Hm...good point. I had interpreted it as more of a caching system for the LSP client. However, I guess the first sentence should've made it clear to me... :(

The purpose of the Language Server Index Format (LSIF) is it to define a standard format for language servers or other programming tools to dump their knowledge about a workspace.

Well then...!

dbaeumer commented 6 years ago

@rcjsuen no the idea is that this is dumped either by the server or a separate tool. As mentioned in the spec I have already written a tools for TypeScript and an generic extension that serves the dump via LSP to any kind of LSP client. I will make these open source the next couple of days.

tsmaeder commented 6 years ago

So if I understand this correctly, this is supposed to be a library for helping language server implementers solve the problem of maintaining an index? If so, why does it need a specification as opposed to simply a documentation of the code? What are the reuse cases?

dbaeumer commented 6 years ago

Looks like I was pretty bad in explaining it: the goal is that we can produce an index to answer LSP requests for read only workspaces without firing up a language server specific to the programming language. There will be one generic language server that can serve the index. Furthermore the index will allow to relate symbols across repositories. See the demo here where Jonathan navigates from the use of observable to the actual definition in source.

svenefftinge commented 6 years ago

Sorry, still not sure I got it. Is it some sort of cache middleware for language servers? If so why does it need to be part of the protocol?

mickaelistria commented 6 years ago

If so why does it need to be part of the protocol?

I am glad the LSP protocol also includes some technical proposals, middleware or intermediary formats to allow combination of different language servers.

So this index format will come with an implementation of a language server able to process multiple indexes to return results? How is this "composite index-based language server" expected to know which LS to retrive indexes from or how to get indexes? And the LSP will be extended so that existing LS could provide the ability to dump and index and then shutdown assuming the index wouldn't have to change?

svenefftinge commented 6 years ago

to allow combination of different language servers

From where did you draw this would be supported? It would indeed be useful if language servers could access a common cross-languages index.

I am glad the LSP protocol also includes some technical proposals, middleware or intermediary formats

I was not implying it should not, but wanted to understand why and how it depends on the LSP (technically).

dbaeumer commented 5 years ago

Some clarifications: the LSIF will not be part of the protocol itself since it is not a protocol. What might be part of the protocol are requests to ask a server to dump it's state.

Why did we decide to put it here: the LSIF is based on LSP data types. The questions that are answerable by a dump are typical LSP requests useful on a read only workspace (for example goto definition, find all references).

Yes, we have developed a generic language server that can read in many indices and serve LSP requests on them. So it can serve a C# index in parallel to a TS index. I will make the TS index generator and the generic language server with an VS Code extension public soon. Will add a message here when available.

dbaeumer commented 5 years ago

Here we go:

https://github.com/Microsoft/lsif-typescript for the TS specific index dumper
https://github.com/Microsoft/vscode-lsif-extension for the extension. Please note that the extension is currently relying on proposed API and can therefore not be installed into VS Code directly

tsmaeder commented 5 years ago

I'm still a bit hazy on the motivation here: are you trying to solve a latency problem? Also, what's the use case for "read-only workpaces"?

dbaeumer commented 5 years ago

It is more about repositories and published version. In my projects I usually have dependencies to many other npm packages which I depend on on a certain version. To be able to navigate and browser them there is no need to spin a whole language server (no need for code complete, signature help, ...). If there would be an index it would be relatively cheap to serve these and to support navigating to them even without cloning the repository locally.

dbaeumer commented 5 years ago

Shown here to navigate from one source base to mobx sources on Github.

tsmaeder commented 5 years ago

It is more about repositories and published version. In my projects I usually have dependencies to many other npm packages

Ahh...precomputed indexes. We've been thinking about this for jdt for a long time :-) However, if you think about maven, for example, the dependencies are not in the workspace tree (like npm deps). I'll have to think about this, but I'm not sure this is not better solved inside the language server.

ShaneDelmore commented 5 years ago

I would love to use this to precompute indexes in CI. For devs working in large repos this would be very helpful.

dbaeumer commented 5 years ago

@tsmaeder if you look at the specification then this is basically split into two passes. The LS or the language tool will generate monikers specific to the tool. A linker tool will make them package manager specific by consulting other information. We even split this for TS / npm to demonstrate that embedding this into the Language tool is problematic. So the idea is more one of a compiler and then a symbol linker.

felixfbecker commented 5 years ago

Is the JSON graph format built on any standard JSON graph representation? I would assume there are already existing formats for that which would be nice to built on, since there may be existing tools that can read/generate them (e.g. save in database, query, visualise, build, etc)

dbaeumer commented 5 years ago

@felixfbecker yes and no: it uses the same property names like label, ìnV, outV as graphSON does but doesn't fully emit standard graphSON. reason being is that graphSON is optimized that the output can be processed by different servers where as we focused on easy and early emit. But the TSC example contains a graphSON emitter: https://github.com/Microsoft/lsif-typescript/blob/master/tsc-lsif/src/emitters/graphSON.ts#L1. But I haven't tested for a while.

svenefftinge commented 5 years ago

I still don't understand how the index-based LS and the 'real' LS would work together. Having a brief look at the repos you shared, it looks like they would simply both be registered for the same language. If that is the case I don't understand how e.g. find references would work. Could you say a few words about that or point me to some information?

dbaeumer commented 5 years ago

That depends on how we would at the end decide how the indexer is run. We haven't made any decision on this. Options are:

standalone tool.
- Then you need two servers and someone to hot swap them
- or you need to pre-index
- or you need to wait
embedded in the language server. Then after the state is dump it could be served from there
or some other combination

I am fully open for ideas here.

tsenart commented 5 years ago

embedded in the language server. Then after the state is dump it could be served from there

With this approach, could we extend LSIF to be a write-through cache? This would support incremental lazy LSIF cache filling which could be merged with asynchronous full "dumps" over time.

If you squint hard enough, this would look similar to the Lambda Architecture, where incoming LSP requests that lazily fill the LSIF cache would map to the Streaming and Serving layers and the asynchronous full "dumps" would map to the Batch layer.

tsenart commented 5 years ago

Addendum: In theory, from what I can see, even a generalized LSIF caching proxy for different language servers would work. So there wouldn't be a need to change each language server individually.

matklad commented 5 years ago

Might be a good idea to take a look at kythe schema which seems to serve a similar-ish purpose. The primary difference is that kythe does try to define symbols&references, while LSIF works purely on the Ranges & Offsets. That is, you can map kythe model to LSIF, but not vice verse. Which is a good thing: LSIF seems much simpler and can be used directly!

dbaeumer commented 5 years ago

@matklad we looked at kythe and other symbol databases and then purposely decided not to use one. Mainly for the reasons you pointed out.

robinp commented 5 years ago

@dbaeumer @matklad I put together a quick list of first impression differences between LSIF and Kythe: https://gist.github.com/robinp/76f9d3d91387da5162f773895d4e1d15. Disclaimer: I don't know much about LSP/LSIF other than browsing the spec and the query docs a bit, and somewhat biased towards Kythe due to previous work with it, so offset that.

yaohaizh commented 5 years ago

One usage case of LSIF for Java is that before the current language server initialized, which might take times, the client can use the LSI to unblock some smartness scenarios immediately after the user open the workspace.

akaroml commented 5 years ago

One usage case of LSIF for Java is that before the current language server initialized, which might take times, the client can use the LSI to unblock some smartness scenarios immediately after the user open the workspace.

This can be very useful for the warm load case. The language server knowledge can be persisted with LSIF in the previous session. And the knowledge can be used to enable basic language server features like symbol navigation in the new session before the actual language server finishes loading the project.

@fbricon This is something we would like to try for the Java language server.

dbaeumer commented 5 years ago

@yaohaizh nice use case.

dbaeumer commented 5 years ago

@robinp thanks a lot for the comparison. Some first feedback to the feedback:

I want to reiterate that LSIF (as LSP) is not providing any symbol information. This is by design. It provides data structures to navigate code using editor abstractions (see https://microsoft.github.io/language-server-protocol/overview). Therefore it doesn't need to specify any programming language specific constructs. The downside is of course that the LSP and LSIF can only answer questions that are foreseen.
We have an issue for providing a compressed format. We started with the verbose JSON one since it is easier to read and understand.
Regarding: At Google, such approach ran into problems due to codebase size: can you provide more insights here. We choose that approach since we thought it is more scaleable since code can be indexed independently.
Regarding having two libraries: we are discussion in which from the moniker needs to have version information to better support this.
We also discussion whether a edge should have a scope property to support different versions of the same file in different contexts (for example header files in cpp)
Scalability: since LSP scales for large projects I am confident that LSIF does as well assuming that we have a more compact format than JSON.

You might be also interested in https://github.com/Microsoft/lsif-typescript/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc

tsmaeder commented 5 years ago

One problem I see with LSIF is that some queries depend on knowing the whole program. Imagine you index a maven project that declares an interface with a method foo(). When we try to find implementers of "foo", the anwer depends on what the user has open in his workspace. It's even worse: the language server might determine that a particular declaration of "foo" is not an implementation (maybe because it's from a different version of the project, not the one the code in the workspace compiles against.

robinp commented 5 years ago

@dbaeumer Thanks! Sounds fair. Re problems due to codebase size I meant what you list as next, the multi-version moniker issue, where references can become ambiguous without exact versioning. I didn't mean some performance issue.

dbaeumer commented 5 years ago

@tsmaeder we discussed this lately and one idea was that LSP adds support to resolve a moniker and that we could have a global index where these can be mapped to LSIF dumps.

mpickering commented 5 years ago

I have now implemented a program which generates LSIF indices for Haskell files. I have two main concerns so far about the format.

For a 100 line file, my program has produced a (once formatted) 23000 line JSON file. I can't imagine how big the output is going to be when I try to index a project like GHC which is over 100 000 lines. It could be that my output could be compressed in some way, I didn't make a big effort to do that yet but the preliminary signs are worrying.
The format is not very compositional. The assumption seems to be that a single project produces one lsif.json file. This would be fine, if there was an easy way to combine together lots of index files without doing lots of recomputation but I can't think of an elegant way to do this and be confident that ids will point to the correct things. So for a big project if you change one of the files you have to completely regenerate the lsif.json file from scratch.

I also don't understand the bit in the specification about imports/exports but there's another issue #680 about that already.

zfy0701 commented 5 years ago

@dbaeumer this is an awesome for LSP! The biggest concern I have is the numerical value of vertex id. it has several limitations as I can see

if the indexer is not perfect, it's common that it crashes and need to be recover, and use the numerical value that increased by the index makes it hard to recover from middle
for large repo, it's hard to run multiple indexing tasks in parallel
it's hard to do the indexing incrementally I think the goal of this project is basically a similar idea of google's kythe project, the use something called vname to identify a vertex, they basically use string value that could be generated deterministically (https://kythe.io/docs/schema/writing-an-indexer.html). I would suggest that at least we make itstring and let user be able to override the way they generated id

Also, I would suggest that there should be some dump options, e.g. we may just want to index references information

dbaeumer commented 5 years ago

@zfy0701 the protocol defines the id as number | string and the tsc-lsif tool has support to emit UUIDs. See https://github.com/Microsoft/lsif-typescript/blob/master/tsc-lsif/src/shared/protocol.ts#L12 and --id here: https://github.com/Microsoft/lsif-typescript/blob/master/tsc-lsif/src/main.ts#L19

dbaeumer commented 5 years ago

@mpickering I agree that the current version is to verbose and I have an item for this. It is https://github.com/Microsoft/lsif-typescript/issues/4.

I started to prototype a compress JSON format that is fully array based and self describing. Will ping if I have something to comment on.

Regarding composition: the idea is that projects can be parsed independently and that import / export results can be used to link symbols between them. I will continue on https://github.com/Microsoft/language-server-protocol/issues/680 and look into implementing that for TypeScript.

jdneo commented 5 years ago

@tsmaeder we discussed this lately and one idea was that LSP adds support to resolve a moniker and that we could have a global index where these can be mapped to LSIF dumps.

@dbaeumer Does that mean the moniker should contain the version info for a project with different versions?

dbaeumer commented 5 years ago

@jdneo see the discussion here: https://github.com/Microsoft/lsif-typescript/issues/10

dbaeumer commented 5 years ago

I will close the issue now that we have lsif-node

microsoft / language-server-protocol

Language Server Index Format #623