Performance problems in large codebases

kubukoz commented 4 months ago

Hi!

My team maintains a "specbase" of about 1200 Smithy files, containing about 6700 non-member shapes. In the recent months, as the repository has grown, we've started noticing significant performance problems - which keep growing.

Here's a real-time recording of me trying to add a small structure to the model.

https://github.com/smithy-lang/smithy-language-server/assets/894884/7dd0c874-26f8-45d1-a8e6-451fa7334ee8

As you can see, all the feedback is significantly delayed. An educated guess is that this is due to a combination of the following:

lots of shapes (obviously, but arguably even 10k shapes isn't that much when you compare it to Java class counts in an average codebase, for example)
some suboptimal validators (e.g. https://github.com/disneystreaming/smithy4s/issues/1319)
the LSP server loading & validating the model on every textDocument/didChange notification
lack of cancellation support in the server.

Can we make some steps to improve the situation? I have these suggestions right off the bat:

Handle cancellation of requests
Deduplicate concurrent didChange notifications (or perhaps just loading the model concurrently altogether)
Consider not reloading the model on didChange, doing it only in didSave instead.

Baccata commented 4 months ago

Consider not reloading the model on didChange, doing it only in didSave instead.

I think this should be pursued no matter what : recompiling on didChange is a pretty bad pattern.

Baccata commented 4 months ago

From what I gather, the current behaviour of didChange stores the buffered contents into a map, such that completion requests can retrieve the token that's being edited, before looking it up in the collection of known shapes.

So I think removing the recompilation and reporting instruction from the didChange implementation would suffice to drastically improve performance.

kubukoz commented 3 months ago

Can we have some direction on this? I would start with @Baccata's suggestion.

milesziemer commented 3 months ago

Yea that's pretty bad. All the suggestions sound like viable solutions - but I think LSP4J can actually handle the first two for us. I looked through their docs and some old issues, and found this. It looks like most if not all request/notification handler methods do all their work synchronously, so sequential didChange (or any event) just pile up if they don't complete right away. The easiest thing to try I think would be what that issue suggests, wrap our didChange in a CompletableFuture:

@Override
public void didChange(DidChangeTextDocumentParams params) {
    CompletableFuture.runAsync(() -> {
        // ... Current method body
    });
}

I need to set up a large code base to see what impact this has. I'll do this now. It might require some other changes elsewhere to make sure everything works asynchronously.

In regards to 3:

Consider not reloading the model on didChange, doing it only in didSave instead.

and

I think this should be pursued no matter what : recompiling on didChange is a pretty bad pattern.

Yea I agree mostly, recompiling on every change is definitely a huge waste. I think the issues it causes though can mostly be addressed by not blocking each subsequent didChange events on the completion of each prior event. We can still recompile on change, but if we can cancel that action when a new didChange notification comes in and start a new one, it shouldn't ever lag like in the video. And you maintain functionality of other features like getting diagnostics for what you've written without needing to save.

kubukoz commented 3 months ago

For the record, language servers vary in their handling of compilation and didChange. I checked a couple that I've used in the past, and:

Metals (Scala) provides syntax errors on change, and semantic diagnostics on save (it triggers a whole module's compilation), and completions without save (but this uses a "cheaper" code path, which I think may be working on an individual compilation unit / file, interacting with the compiler directly instead of through a module-wide build server)
Rust Analyzer provides syntax errors on change, inlay hints on change, but semantic diagnostics only on save
Typescript provides all sorts of diagnostics on change.

Notably, typescript's LS and Rust Analyzer are very fast in what they do, and as we've seen Smithy's build times are suboptimal and heavily affected by how the validators in scope are written.

That is to say, with the current state of affairs I would suggest that the Smithy LS only perform (at most!) basic checks in didChange: syntax errors are fair game I think, and mayybe loading the model with no external validators (maybe this part could be configurable?), but I'd say we should save validators for the didSave notification.

As for cancellation, that might work, but the code probably needs to be aware of InterruptedException (I'm not 100% sure, haven't worked with that side of Java a lot to be honest), so moving things around between didChange and didSave might get us an improvement with less effort.

Baccata commented 3 months ago

syntax errors are fair game I think, and mayybe loading the model with no external validators (maybe this part could be configurable?), but I'd say we should save validators for the didSave notification.

If it was me, I'd tackle it iteratively : separating the "basic" validation from the validators will probably require more code.

As for cancellation, that might work, but the code probably needs to be aware of InterruptedException (I'm not 100% sure, haven't worked with that side of Java a lot to be honest)

I don't think @milesziemer was talking about proper cancellation (as in interrupting an ongoing procedure), more about debouncing (which is a common practice when key-strokes are involved). You essentially schedule a callback to trigger after a small duration (could be something like 2 seconds), and if new events come before the timer has expired, you un-schedule the previous callback and schedule a new one. See an example of java debouncer.

milesziemer commented 3 months ago

I think there's two issues here with the same symptoms:

Building the model in large code bases can be slow, especially when there lots of validators and/or sub-optimal validators
The language server handles all requests and notifications (request -> something like hover, notification -> something like didChange) synchronously, so if it receives a bunch of didChange, it's going to wait for each one to complete until moving on to the next one

Both can lead to the language server being slow to provide feedback. If it takes forever to build the model, or if the language server has received a bunch of sequential requests, the client will be kept waiting. Right now, the language server relies entirely on the model to provide its features, so 1 exacerbates 2.

syntax errors are fair game I think, and mayybe loading the model with no external validators (maybe this part could be configurable?), but I'd say we should save validators for the didSave notification.

I see what you mean, even if requests don't get backed up waiting for previous requests to complete, if the model takes forever to build you won't get quick feedback as you type. As @Baccata said, it's definitely a bit more involved to do, but I think it is possible. We could try disabling validation when we load the model on didChange - I think that would still get syntax errors but I'd need to check. Or we load the model without validation, send any events back to the client, then trigger validation. Either way, if validation is a bottleneck, not doing it would get feedback to the client faster (even if the feedback contains less information), and we should be trying to get feedback to the client as fast as possible.

As for cancellation, that might work, but the code probably needs to be aware of InterruptedException (I'm not 100% sure, haven't worked with that side of Java a lot to be honest)

I don't think @milesziemer was talking about proper cancellation (as in interrupting an ongoing procedure), more about debouncing (which is a common practice when key-strokes are involved). You essentially schedule a callback to trigger after a small duration (could be something like 2 seconds), and if new events come before the timer has expired, you un-schedule the previous callback and schedule a new one. See an example of java debouncer.

Yea you're right, looking at it now, my suggestion didn't actually cancel the previously dispatched didChange handler. It just allowed a new didChange event to be handled without waiting for the previous didChange to complete. So if the language server gets 3 didChange, you only wait as long as it takes for the latest one to complete, rather than all three sequentially. This also frees up the server to accept other requests while the didChange handler is executing. So it doesn't help with getting feedback faster for a single build, and the server can consume much more resources if multiple didChange handlers are going at once.

I think we need to address both of these issues, but it sounds like addressing 1 first might be best. I did some testing last night in a project with ~100k non-member shapes and didn't get the same kind of slowdown. Not an accurate one-to-one comparison, but worth investigating where the bottleneck is. Have you tried disabling that validator you mentioned? It looks like its going through every shape referenced by an @adtMember, for each of those, all the shapes that are targeted by the @adtMember, and for each of those every member in the entire model. I'm not sure how many members have the @adtMember trait, but this probably contributes. An idea to help improve the performance of this validator would be to use NeighborProviderIndex which computes and caches the relationships between shapes, so you don't need to look at every member in the model. Take a look at DefaultTraitValidator for an example of its usage. You also might be able to just directly check if adtParent has the correct member, using Shape::getMember. Maybe there's some other reason this doesn't work though that I'm not seeing.

Baccata commented 3 months ago

To be clear : unlike @kubukoz's, my organisation doesn't run the adtMember validator in our large smithy repo, and still experiences the sluggishness of the editor. Though I can certainly accept that optimising validators is a worthwhile endeavour, I still think that removing the compilation on "didChange" or at least debouncing it (to prevent multiple successive expensive runs) would be easily achieved and very beneficial.

milesziemer commented 3 months ago

I've just put up a PR that should address this issue as best we can short-term (not to undersell the improvements - but we can still do better). It has a combination of the ideas in this issue - thank you for the suggestions. This won't build locally right now (it needs some Smithy changes), but when it can, you're welcome to build it locally and see how it is. I'll give an update when that's possible (and instructions to do so, probably a readme entry). Thanks for your patience

smithy-lang / smithy-language-server

Performance problems in large codebases #145