hellonico commented 1 month ago

I get the following error when using llama.clj

Answer to question 1:
There is a significant change in the levels of nitrogen dioxide over the past five years, with a generally downward trend. The level of nitrogen dioExecution error (AssertionError) at com.phronemophobic.llama.raw/llama-eval* (raw.clj:284).
Assert failed: Context size exceeded
(< n-past (llama_n_ctx ctx))

Here is my setup:

(def model "models/llama-2-7b-chat.ggmlv3.q4_0.bin")
(def ctx 
  (llama/create-context model {:n-gpu-layers 1}))

(defn print-response
  [ctx prompt]
   (transduce
    (take-while (fn [_] (not (Thread/interrupted))))
    (completing (fn [_ s] (print s) (flush)))
    nil
    (llama/generate ctx prompt nil)))

I don't think I am setting much data in the prompt. (maybe 5-10 lines)

Do you think this is a problem with the model I use ?
Is there a setting that can be used to avoid this error while generating the answer ?

hellonico commented 1 month ago

Actually, I get the same kind of error with:

;; https://huggingface.co/Qwen/Qwen2-0.5B-Instruct-GGUF/resolve/main/qwen2-0_5b-instruct-q4_k_m.gguf?download=true
(def model "models/qwen2-0_5b-instruct-q4_k_m.gguf")

Please create these additional questions based on the given table and dataExecution error (AssertionError) at com.phronemophobic.llama.raw-gguf/llama-eval* (raw_gguf.clj:305).
Assert failed: Context size exceeded
(< n-past (llama_n_ctx ctx))

phronmophobic commented 1 month ago

I don't think it's a problem with the model. You can change the context size to be larger. The maximum context size will depend on your hardware and the model.

(def model "models/llama-2-7b-chat.ggmlv3.q4_0.bin")
(def ctx 
  (llama/create-context model {:n-gpu-layers 1 :n-ctx 2048}))

One option to avoid the error is to limit the generation to the context size:

(require '[com.phronemophobic.llama.util :as llutil])
(def context-size 2048)

(def model-path "models/qwen2-0_5b-instruct-q4_k_m.gguf")
(def ctx (llama/create-context model-path {:n-ctx context-size}))
(defn my-generate 
  "Returns a seqable/reducible sequence of strings generated from ctx with prompt."
  ([ctx prompt]
   (let [prompt-token-count (count (llutil/tokenize ctx prompt))]
     (eduction
      (take (- context-size prompt-token-count))
      (llama/decode-token ctx)

      (llama/generate-tokens ctx prompt )))))

(defn print-response
  [ctx prompt]
   (transduce
    (take-while (fn [_] (not (Thread/interrupted))))
    (completing (fn [_ s] (print s) (flush)))
    nil
    (my-generate ctx
                 (llama/chat-apply-template
                  ctx
                  [{:role "user"
                    :content prompt}]))))

(print-response ctx "Please tell me a long story.")

It's up to you to decide how to handle responses that exceed the context size. There are many different approaches to either increasing the effective context size or breaking the problem into smaller pieces to work within a smaller context. I've found the LocalLLama subreddit to be a good resource, https://www.reddit.com/r/LocalLLaMA/search/?q=context+size.

hellonico commented 1 month ago

Changing the context size did the trick indeed.

(def ctx 
  (llama/create-context model {:n-gpu-layers 1 :n-ctx 2048}))

Maybe with the current models changing the default would be good ? Or an error message pointing at the context size being the culprit would be great?

Thank you!

phronmophobic commented 1 month ago

While I agree that it would probably make sense to have a higher default context size, I'm also wary of changing behavior of the wrapped library in subtle ways. Advanced usage of llama.clj may be following documentation or advice from the underlying llama.cpp and it can be very frustrating if an example doesn't work because of changes in behavior due to the wrapper itself. As a user, you then have to check the original project and reference the wrapper implementation to figure out how everything works. A design goal is to try to make it so that, when practical, any technique that works for llama.cpp can be applied to llama.clj in a straightforward way.

Or an error message pointing at the context size being the culprit would be great?

It probably makes sense to provide some guidance, somewhere, about context sizes. I'm wary of suggesting that increasing the context size is the "cause" of the exception since there are trade-offs to increasing the context size or the context size might already be at its maximum.