Open vyeevani opened 2 years ago
This can done with a mix of inline suggestions and other vscode extension functionalities.
The flow that we want to handle the above user requirements is the following: program event loop: if user makes change reset context for word prediction predict next word add prediction to the editor We can break this up into various chunks that are actually handled by the vscode extension API.
We get notified by the inline suggestions function whenever the user makes a change to the document and a new suggestion is required. This will handle the "if user makes change" block.
Given that the rate of the prediction is much lower than the program event loop, it makes sense to shift this to a separate thread, i.e a prediction requestor/receiver thread. Finally the add prediction to editor is fairly simple but uses a lot of the vscode extension APIs that I'm not personally familiar with.
From the looks of it: https://github.com/microsoft/vscode-extension-samples/blob/main/decorator-sample/src/extension.ts pretty much handles it. We can also do fancy things with the colors and highlighting to represent various parts of the suggestions that we are making.
There's a number of things that will also have to be handled, like clearing out the previous suggestions if the user isn't using those suggestions. In the ideal case, we will have to be pretty flexible here in terms of how the user's input matches the suggestions themselves. For example, we don't want to clear out the entire suggestion if the user types in a mismatched space or something. We'd also want to memoize things. However, I think these are all later features that aren't necessary to the core inference speed up this issue is addressing.
I'm switching the IPC from unix pipes to a HTTP based interface. This is both to simplify the communications with extensive existing codebases that make it easier to handle request/response methods as well as future proof for remote server based deployments of the underlying LLM.
The biggest usability problem right now is the massive latency. On an M1 Pro MacBook Pro, each inference can take as long as a second. This is a very poor user experience.
It appears that the vast majority of the latency is coming from the generation stage of the transformer rather than just the forward itself as I previously thought.
This leads me to a potential upgrade to the user experience. If I know that the suggestions that the model is making are really bad, then I should be able to immediately start typing and prevent the model from wasting time rolling out bad suggestions. If I think it's doing fairly well, then I should be fine to let it run until I don't like it anymore, tab and this would stop it from generating more. By giving the user visibility into what the model is thinking and what it's doing, we hide the latency in the user's processing of the model's output, i.e reduce the time the user is sitting around getting angry at the extension.
Architecture pending