nvms / wingman

Your pair programming wingman. Supports OpenAI, Anthropic, or any LLM on your local inference server.
https://marketplace.visualstudio.com/items?itemName=nvms.ai-wingman
ISC License
64 stars 10 forks source link

Using the new version with local open source backends #29

Open synw opened 11 months ago

synw commented 11 months ago

Hi. I have checked the new version of Wingman, and I am quite disappointed by it: there is no first class support for local models and backends. Apparently the extension was built around the way that the big platforms work. I see that the concept of prompt template is not present, and this is a pain for local usage. When running local models it is very important to be able to use the correct template format according to the model if you want good results. For now I must duplicate every prompt and manually add the template format, for each model/template type: the prompt and the template are not separated here, as the big players apis do not need templating.

About the presets, it seems confusing to me because it mixes up some concepts: the provider with api type and connection params, the inference params (that depend on each query, not on the provider), and the system message that is a template concept. Some templates have a system message, and some do not, like Mistral for example.

What about implementing a prompt templates support? The idea would be able to use the predefined prompts with different template formats. If it can help I have made the Modprompt library that does this, supporting many generic template formats, as well as few shots prompts, that are often useful when working with small models.

About the providers I have made the Locallm library that supports multiple providers with a unique api: Llama.cpp server, Koboldcpp and Ollama: it may help to simplify implementing these providers support if you wish.

nvms commented 11 months ago

I'm sorry to hear you're disappointed by it.

Long message incoming because I want to make sure I address all of your comments.

there is no first class support for local models and backends. Apparently the extension was built around the way that the big platforms work. I see that the concept of prompt template is not present, and this is a pain for local usage. When running local models it is very important to be able to use the correct template format according to the model if you want good results.

Well, I guess this depends on your definition of "first class support". I am using the current version of Wingman locally with llama, llama 2, orca, and mistral fine-tunes using LM Studio without any issues -- and each of these models uses a different prompt format.

LM Studio's inference server accepts the responsibility of prompt formatting (which is honestly a relief), so the client doesn't need to concern itself with this. Here's an example:

Screenshot 2023-12-05 at 11 10 34 AM

The only requirement LM Studio imposes is to use a client that can send/receive in the OpenAI request/response format, which Wingman does. In my opinion, model-specific prompt formatting should absolutely be handled where the model is hosted. It just seems like a obvious design choice. That said, I'm not opposed to surfacing client-side prompt formatting controls in the UI and making them optional.

For now I must duplicate every prompt and manually add the template format, for each model/template type: the prompt and the template are not separated here, as the big players apis do not need templating.

That is obviously more than annoying. Is the lack of client-side prompt formatting the only thing preventing you from using Wingman with local models? What inference server are you running that requires you to format the prompt from the client side?

About the presets, it seems confusing to me because it mixes up some concepts: the provider with api type and connection params, the inference params (that depend on each query, not on the provider)

The API type being linked with connection details for an endpoint that implements that API type seems fairly logical to me.

Without knowing more details about your workflow, I'm having a hard time understanding why presets are confusing. They are a collection of sane defaults for a particular provider, and can be duplicated for all of the various provider permutations you might have.

..., and the system message that is a template concept. Some templates have a system message, and some do not, like Mistral for example.

FWIW I'm using https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B and it does recognize a system message. Anyways, would you like to see an option to disable the system message at the preset level?

Mistal w/ system message (LM Studio logs behind):

Screenshot 2023-12-05 at 10 42 42 AM

the inference params (that depend on each query, not on the provider)

Given that different providers support completion parameters that do roughly the same thing, but are labeled differently (e.g., max_tokens and max_tokens_to_sample), it seemed logical to me to organize these very-provider-specific parameters along with the provider that expects to see them.

It sounds like you want to be able to define completion parameters at the prompt level. You can already do this using the {{:temperature:0.5}} format. This is documented in the README. These overrides take priority over preset default values. If the completion parameter is unknown (e.g. {{:bad_param:1}}) it is ignored. "Known" or "unknown" is determined by the API specification of the provider being used. For example, an inference server that implements the OpenAI request/response format is also expected to support the same completion parameters of that provider.

About the providers I have made the Locallm library that supports multiple providers with a unique api: Llama.cpp server, Koboldcpp and Ollama: it may help to simplify implementing these providers support if you wish.

I'd like to support the big platforms as well for obvious reasons. With Locallm would I need to continue to maintain support for them myself while handling the others with Locallm, or do you plan to support them as well?

What about implementing a prompt templates support? The idea would be able to use the predefined prompts with different template formats. If it can help I have made the Modprompt library that does this, supporting many generic template formats, as well as few shots prompts, that are often useful when working with small models.

https://github.com/lgrammel/modelfusion was suggested to me as a solution for abstracting both the provider communication as well as prompt formatting on the client side of things. It seems the only provider support it is missing is KoboldCpp. If that's the only provider I'd need to manually support, I'm fine with that. It also supports multi-modal vision models like gpt-4-vision-preview or Ollama multi-modal (https://github.com/jmorganca/ollama/pull/1216), which could be very useful in the context of a tool like Wingman. Think UML DB or class diagrams, infrastructure diagrams, etc., e.g.: provide a prototypical diagram of a service architecture, and have the model generate a terraform configuration that defines these services for you. Support for this feels natural for Wingman, and I hope to implement this in the near future. Do you have plans to support this in Locallm?

I'm hoping that a standard emerges soon that moves the responsibility of prompt formatting to the inference server where the model host (usually yourself) can define it there, as in the case of LM Studio. Again, as mentioned in the very beginning, I'm currently working with various local LMs that all expect different prompt formats with absolutely no issues.

Regardless, a solution like Locallm or modelfusion that unifies all the provider communication stuff is absolutely the way forward, I just need to pick a solution that makes the most sense for the extension.

nvms commented 11 months ago

I want to mention that I realize LM Studio is not open source, but the only feature it has that I believe solves your first issue is prompt formatting -- a feature I'm sure an open source inference server should be able to support out-of-the-box, although I'm not exactly aware of any because all I'm using is the aforementioned solution.

Everything else its doing is pretty unremarkable.

capdevc commented 11 months ago

Re: prompt templates, I think the community is moving in the direction of that being handled by the model server. They can be built into the gguf model files via the support for arbitrary metadata, but I don't think that there's a standard for doing that currently. Outside of embedding the template in the gguf file, Ollama provides a way to add it to the model via a Dockerfile-like model definition file like this:

FROM llama2:13b
TEMPLATE """[INST] {{ if and .First .System }}<<SYS>>{{ .System }}<</SYS>>

{{ end }}{{ .Prompt }} [/INST] """
SYSTEM """"""
PARAMETER stop [INST]
PARAMETER stop [/INST]
PARAMETER stop <<SYS>>
PARAMETER stop <</SYS>>

docs are at https://github.com/jmorganca/ollama/blob/00d06619a11356a155362013b8fc0bc9d0d8a146/docs/modelfile.md

This pushes the responsibility for managing that to ollama rather than an API caller like Wingman.

FWIW, modelfusion also provides some prompt template handling for servers like llama.cpp that don't handle it themselves due to design decisions or lack of a templating engine etc.

I've created a new issue (#30) for discussion about the modelfusion stuff. I agree that it may be better to separate out the Preset parameters that are provider api configuration vs completion/chat request parameters.

synw commented 11 months ago

Thanks for you answers, it's good to have different point of views. I should give more details about my workflow and how I see the things.

What inference server are you running that requires you to format the prompt from the client side?

Now I use Koboldcpp and the Llama.cpp server locally, and sometimes Ollama. The first two engines do not support templating. I run my own frontend that has a template editor, and send the formated prompt to the inference server.

In my opinion, model-specific prompt formatting should absolutely be handled where the model is hosted

I don't think so for several reasons:

You can check out the InferGui frontend to see how I handle the templating client side. The concepts are organized like this:

The provider and parameters are managed by Locallm: note that the lib provides an unified format for parameters, and it manages to translate them to the specific provider format before sending them. Are the parameters embeded in the prompts in the actual version of Wingman provider specific? It would be nice to be able to decouple the params format from the providers.

The templates are managed by the modprompt library, check out the data types here: https://github.com/synw/modprompt#types to see how it is organized

I'd like to support the big platforms as well for obvious reasons. With Locallm would I need to continue to maintain support for them myself while handling the others with Locallm, or do you plan to support them as well?

There is no plan to support the proprietary apis in the Locallm lib, no. I don't have api keys and don't use the big players models apis. Locallm is a small lib with the limited responsability of giving you one api to handle Lllama.cpp, Koboldcpp and Ollama. It actually has good support for the two first ones, and a limited one for Ollama, as the latest do not yet support Gbnf grammars for example, and the multimodal is on the way. Locallm has support for multimodal and gbnf grammars already (check the gui).

I did not know about Modelfusion, I need to try it out. It looks like a framework, doing many things.

To provide more context about my Wingman usage, here is an example use case: I have a custom prompt with an example shot, quite long. I want to try it out on different models, let's say I have a Koboldcpp running a Mistral on my machine, and a LLama.cpp running a Deepseek coder on my phone, accessible via local network. So ideally I would just have to switch provider, without having to modify anything in my prompt, and select a template format for the model: this would automatically clone the prompt to the desired format, including the example shot.

They can be built into the gguf model files via the support for arbitrary metadata, but I don't think that there's a standard for doing that currently

Interesting. That would solve some problems.

RyzeNGrind commented 8 months ago

Hello this project seems super cool. I would like to add some suggestions as I am currently using a proprietary model based VSCodium solution called Cursor but I am looking to switch to building my own VSCodium or Theia based AI enabled IDE.

  1. I came across this project called Gateway It seems like a good tool for unifying single API interface across all, many models.

  2. There is also this inference server tool from Nvidia called Triton Inference Server. They have lots of support for machine learning models and frameworks as well.

I think it would help relieve some of the maintenance and technical debt burdens you may face in trying to expand horizontally to support more models especially "first class support" for local/self hosted models while enabling access to cloud providers at the same time.

Wingman seems really cool so I will be trying it out and seeing how I can perhaps contribute as well.