llama.cpp usage - Githubissues

Vesyrak commented 9 months ago

Hey,

First of all, thanks for working on this! I was trying to get the local llamacpp provider working, without much success. I tried to debug what was going wrong, but my lack of lua knowledge makes this slow and difficult.

What I did to try and get it working was the following:

Clone llama.cpp
Download 7B huggingface llama model
Convert model
point LLAMACPP_DIR to the root llama.cpp folder
Install llm.nvim

Use the following config based on the starter

local llamacpp = require("llm.providers.llamacpp")
require("llm").setup({
prompts = {
    llamacpp = {
        provider = llamacpp,
        params = {
            model = "<converted_model_path>",
            ["n-gpu-layers"] = 32,
            threads = 6,
            ["repeat-penalty"] = 1.2,
            temp = 0.2,
            ["ctx-size"] = 4096,
            ["n-predict"] = -1,
        },
        builder = function(input)
            return {
                prompt = llamacpp.llama_2_format({
                    messages = {
                        input,
                    },
                }),
            }
        end,
        options = {
            path = "<llama.cpp root folder>",
            main_dir = "build/bin/Release/", 
        },
    },
},
})

Current environment is nvim v0.9.0, on a MacOS M1. Are there any steps I'm missing?

gsuuon commented 9 months ago

Hi, thanks for checking it out! Did you skip the llama.cpp build step? The starter assumes a working llama.cpp ./main, so you should already be able to run ./build/bin/Release/main [--opts] (or wherever you've set the main_dir to be) in the llama cpp directory. I can make this more explicit in the readme or starter comments if that's the missing step. That said, I'd recommend using a rest api server on top of llamacpp since the start-up time for each request can take a good while, unless you're specifically intending to experiment with different flags to ./main via llm.nvim prompts.

JoseConseco commented 9 months ago

Llama cpp has build in server support now (since like 2 months?) ./server -m ./models/codellama-13b-instruct.Q5_K_S.gguf -ngl 80 ngl - is for running model partially on gpu (its very fast - 10 lines of code in 1-2 secs! ) . And above model supposed to be gpt3.5 quality ;) I tried to make llm.nvim work with:

  {
    "gsuuon/llm.nvim",
    config = function()
      local llm = require('llm')
      local curl = require('llm.curl')
      local util = require('llm.util')
      local provider_util = require('llm.providers.util')

      local M = {}

      ---@param handlers StreamHandlers
      ---@param params? any Additional params for request
      ---@param options? { model?: string }
      function M.request_completion(handlers, params, options)
        local model = (options or {}).model or 'bigscience/bloom'

        -- TODO handle non-streaming calls
        return curl.stream(
          {
            -- url = 'https://api-inference.huggingface.co/models/', --.. model,
            url = 'http://127.0.0.1:8080/completion',
            method = 'POST',
            body = vim.tbl_extend('force', { stream = true }, params),
            headers = {
              -- Authorization = 'Bearer ' .. util.env_memo('HUGGINGFACE_API_KEY'),
              ['Content-Type'] = 'application/json',
              -- ['data'] = '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}',

            }
          },
          function(raw)
            provider_util.iter_sse_items(raw, function(item)
              local data = util.json.decode(item)

              if data == nil then
                handlers.on_error(item, 'json parse error')
                return
              end

              if data.token == nil then
                if data[1] ~= nil and data[1].generated_text ~= nil then
                  -- non-streaming
                  handlers.on_finish(data[1].generated_text, 'stop')
                  return
                end

                handlers.on_error(data, 'missing token')
                return
              end

              local partial = data.token.text

              handlers.on_partial(partial)

              -- We get the completed text including input unless parameters.return_full_text is set to false
              if data.generated_text ~= nil and #data.generated_text > 0 then
                handlers.on_finish(data.generated_text, 'stop')
              end
            end)
          end,
          function(error)
            handlers.on_error(error)
          end
        )
      end

      require('llm').setup({
        hl_group = 'Substitute',
        prompts = util.module.autoload('prompt_library'),
        default_prompt = {
          provider = M,
          options = {
            -- model = 'bigscience/bloom'
          },
          params = {
            return_full_text = false
          },
          builder = function(input)
            return { inputs = input }
          end
        },
      })
    end,
  },

based on 'Adding your own prowider' - https://github.com/gsuuon/llm.nvim/blob/main/lua/llm/providers/huggingface.lua But now when try to run: LLm - it will throw error: obraz

Configuring llm.nvim is bit hard, not sure what I did wrong. Do I have to write my own prompts for it? I know there is OpenAI compability server script for llama (thus u have to run 2 servers: [llamacpp server] -> [OAI compatibility server] -> [nvim gpt plugin] , so that we can use openAI plugins, but with llama doing the work. But I would rather make it work directly with llama server.

gsuuon commented 9 months ago

Hi Jose! You know, I think it'd probably be better to simply remove the llamacpp cli provider and switch it to targeting the llamacpp server directly. Outside of playing around with llamacpp flags, the cli provider won't be very useful. I assumed most people would just use an openai compat server, but that does add another dependency and setup step.

JoseConseco commented 9 months ago

I think I'm targeting llama server directly - with curl.stream () , is that what you meant?

In any case I manged to make some progress:

https://github.com/gsuuon/llm.nvim/assets/13521338/ef8abbee-bc54-4f51-b9c0-01d36dc4229e

As you can see code llama is very fast, (thx to compiling llampaccp, with cublast - cuda support) EDIT: Ok found out that indeed we can override on_finish() in setup section: Still without formatting. It just outputs everything into buffer.

{
    "gsuuon/llm.nvim",
    config = function()
      local llm = require "llm"
      local curl = require "llm.curl"
      local util = require "llm.util"
      local provider_util = require "llm.providers.util"
      local llamacpp = require "llm.providers.llamacpp"

      local M = {}

      ---@param handlers StreamHandlers
      ---@param params? any Additional params for request
      ---@param options? { model?: string }
      function M.request_completion(handlers, params, options)
        local model = (options or {}).model or "bigscience/bloom"
        -- vim.print(params)

        -- TODO handle non-streaming calls
        return curl.stream({
          -- url = 'https://api-inference.huggingface.co/models/', --.. model,
          url = "http://127.0.0.1:8080/completion",
          method = "POST",
          body = vim.tbl_extend("force", { stream = true }, params),
          headers = {
            -- Authorization = 'Bearer ' .. util.env_memo('HUGGINGFACE_API_KEY'),
            ["Content-Type"] = "application/json",
            -- ['data'] = '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}',
          },
        }, function(raw)

          provider_util.iter_sse_items(raw, function(item)
            local data = util.json.decode(item)

            if data == nil then
              handlers.on_error(item, "json parse error")
              return
            end

            if data.token == nil then
              if data ~= nil and data.content ~= nil then
                -- non-streaming
                -- write data.content into active buffer
                -- vim.api.nvim_put({ data.content }, "c", true, true) -- prints onout
                handlers.on_finish(data.content, "stop")
                return
              end

              handlers.on_error(data, "missing token")
              return
            end

            local partial = data.token.text

            handlers.on_partial(partial)

            -- We get the completed text including input unless parameters.return_full_text is set to false
            if data.generated_text ~= nil and #data.generated_text > 0 then
              handlers.on_finish(data.generated_text, "stop")
            end
          end)

        end, function(error)
          handlers.on_error(error)
        end)
      end

      local segment = require "llm.segment"
      require("llm").setup {
        hl_group = "Substitute",
        -- prompts = util.module.autoload "prompt_library",
        default_prompt = {
          provider = M,
          options = {
            -- model = 'bigscience/bloom'
          },
          params = {
            return_full_text = false,
          },
          builder = function(input)
            return {
              prompt = llamacpp.llama_2_format {
                messages = {
                  input,
                },
              },
            }
          end,
          mode = {
            on_finish = function (final) -- somehow contains partial result... for llamacpp
              -- vim.notify('final: ' .. final)
              vim.api.nvim_put({final}, "c", true, true) -- prints onout
            end,
            on_partial = function (partial)
              vim.notify(partial)
            end,
            on_error = function (msg)
              vim.notify('error: ' .. msg)
            end
          }

        },
        prompts = {
          -- ask = {
          --   provider = M,
          --   hl_group = "SpecialComment",
          --   params = {
          --     return_full_text = false,
          --   },
          --   builder = function(input)
          --     return { inputs = input } -- will output gibberish
          --   end
          -- },
        },
      }
    end,
  },

gsuuon commented 9 months ago

I meant that I should remove the current llamacpp provider which uses the CLI and just have it talk to the server instead.

You're very close! Just call handlers.on_partial with data.content instead of on_finish. Calling on_finish is just necessary for post-completion transformers that are on the prompt (like extracting markdown). Do you want to open a PR to change the llamacpp provider to the server? Otherwise, I'll likely do that sometime today.

gsuuon commented 9 months ago

Re: your edit -- mode is not necessary for implementing a new provider, it's just there if none of the default modes work for your use-case. You can hook into each part of the provider and do your own thing (e.g., I don't have chat in a sidebar implemented but you could add that as a new mode by overriding these). This is something you would use on the prompt side, not the provider side. This is helpful feedback, I'll clarify the design in the readme - it's not a great explainer at the moment.

JoseConseco commented 9 months ago

@gsuuon I'm slowly getting there, but now with or without mode the response in automatically removed after last server response:

https://github.com/gsuuon/llm.nvim/assets/13521338/37e1ba26-7081-4fd3-85a7-49093eaf5cbc

I mean - I can just undo - to bring it back but is it not optimal. Also I noticed undo will remove one word by word, rather than whole, accumulated server response. I guess it would be cool if u could make it so that undo removes whole server reply, rather than by chunks. Edit: Yes I can make PR, when this works at least somehow. My target would be to make it so that lln.nvim takes:

user input from input box, (rather than whole buffer like it is working now)
optional selected lines as additional input For server query. Right now it takes whole buffer as input, which is not optimal.

gsuuon commented 9 months ago

That's the purpose of the :LlmDelete command - I couldn't figure out a nice way to do that with the nvim api (wouldn't mind a PR here as well, since undo is more intuitive). :LlmDelete will put back any text that was replaced.

The input argument given to the prompt builder gets the entire buffer if you haven't selected anything, it'll only be the selected text if there's a selection. This part should be left to the prompt to handle. I'm not sure what you mean by input box -- do you mean the command arguments? Like :Llm myprompt additional instructions here? That's exposed as context.args.

Optional selected lines would be a nice feature but should be a separate PR from updating the llamacpp provider if you do tackle it.

gsuuon commented 9 months ago

@JoseConseco Oh btw, you can apply this patch to prevent the completion from disappearing - it's because on_finish is being called with an empty string but we can just move that higher up and let the provider abstraction handle it with a sane default.

diff --git a/lua/llm/provider.lua b/lua/llm/provider.lua
index 8e5af0f..a0ebc10 100644
--- a/lua/llm/provider.lua
+++ b/lua/llm/provider.lua
@@ -30,7 +30,7 @@ M.mode = {

 ---@class StreamHandlers
 ---@field on_partial (fun(partial_text: string): nil) Partial response of just the diff
----@field on_finish (fun(complete_text: string, finish_reason?: string): nil) Complete response with finish reason
+---@field on_finish (fun(complete_text?: string, finish_reason?: string): nil) Complete response with finish reason. Leave complete_text nil to just use concatenated partials.
 ---@field on_error (fun(data: any, label?: string): nil) Error data and optional label

 local function get_segment(input, segment_mode, hl_group)
@@ -232,12 +232,19 @@ end
 local function request_completion_input_segment(handle_params, prompt)
   local seg = handle_params.context.segment

+  local completion = ""
+
   local cancel = start_prompt(handle_params.input, prompt, {
     on_partial = function(partial)
+      completion = completion .. partial
       seg.add(partial)
     end,

     on_finish = function(complete_text, reason)
+      if complete_text == nil or string.len(complete_text) == 0 then
+        complete_text = completion
+      end
+
       if prompt.transform == nil then
         seg.set_text(complete_text)
       else

And then you would change the on_finish call to just be on_finish() in the provider

JoseConseco commented 9 months ago

@gsuuon on_finish() - being called with empty string arg, that was my guess but I did not have time to make it work (by concatenating strings like u showed above, and feeding into on_finish) . About input string, yes I saw you could add question as argument. But IMO it would be great, if you could create prompt :Llm "process_sel" with would :

take selection as argument 1,
then show popup with input box - where u tell what to do (eg. translate above arg1 text in English) I guess it is just matter of UI -- :Llm myprompt additional instructions here - seems more 'retro' style.
Thx for diff.

gsuuon commented 9 months ago

That can be added in the prompt, there's a starter called 'instruct' that shows how to use vim.ui.input to get some input.

EDIT: just noticed that example is out of date, fixed to the current api (you can return a function from builder to do async things)

JoseConseco commented 9 months ago

I think above 'instruct' would have to be remade into llama format:

<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]

In any case I made PR. Tomorrow I can make some fixes if needed.

JoseConseco commented 9 months ago

@gsuuon is there way to output AI message to popup window?

https://github.com/gsuuon/llm.nvim/assets/13521338/f923bb23-6374-4cfd-88e7-809613e9b440

The issue I have now - llama will output code with comments, and I cant seem to force it to output the pure code only. It would help, if output was written to new popup windwo, where user could copy the code, close popup, and paste into place... I was also wondering maybe: LlmDelete - shoudl be renamed LlmUndo - since this is what it is for

gsuuon commented 9 months ago

@JoseConseco moved to https://github.com/gsuuon/llm.nvim/discussions/15

gsuuon / model.nvim

llama.cpp usage #13