phronmophobic / llama.clj

Run LLMs locally. A clojure wrapper for llama.cpp.
MIT License
139 stars 9 forks source link

Context size exceeded #13

Closed hellonico closed 1 month ago

hellonico commented 1 month ago

I get the following error when using llama.clj

Answer to question 1:
There is a significant change in the levels of nitrogen dioxide over the past five years, with a generally downward trend. The level of nitrogen dioExecution error (AssertionError) at com.phronemophobic.llama.raw/llama-eval* (raw.clj:284).
Assert failed: Context size exceeded
(< n-past (llama_n_ctx ctx))

Here is my setup:

(def model "models/llama-2-7b-chat.ggmlv3.q4_0.bin")
(def ctx 
  (llama/create-context model {:n-gpu-layers 1}))

(defn print-response
  [ctx prompt]
   (transduce
    (take-while (fn [_] (not (Thread/interrupted))))
    (completing (fn [_ s] (print s) (flush)))
    nil
    (llama/generate ctx prompt nil)))

I don't think I am setting much data in the prompt. (maybe 5-10 lines)

hellonico commented 1 month ago

Actually, I get the same kind of error with:

;; https://huggingface.co/Qwen/Qwen2-0.5B-Instruct-GGUF/resolve/main/qwen2-0_5b-instruct-q4_k_m.gguf?download=true
(def model "models/qwen2-0_5b-instruct-q4_k_m.gguf")
Please create these additional questions based on the given table and dataExecution error (AssertionError) at com.phronemophobic.llama.raw-gguf/llama-eval* (raw_gguf.clj:305).
Assert failed: Context size exceeded
(< n-past (llama_n_ctx ctx))
phronmophobic commented 1 month ago

I don't think it's a problem with the model. You can change the context size to be larger. The maximum context size will depend on your hardware and the model.

(def model "models/llama-2-7b-chat.ggmlv3.q4_0.bin")
(def ctx 
  (llama/create-context model {:n-gpu-layers 1 :n-ctx 2048}))

One option to avoid the error is to limit the generation to the context size:

(require '[com.phronemophobic.llama.util :as llutil])
(def context-size 2048)

(def model-path "models/qwen2-0_5b-instruct-q4_k_m.gguf")
(def ctx (llama/create-context model-path {:n-ctx context-size}))
(defn my-generate 
  "Returns a seqable/reducible sequence of strings generated from ctx with prompt."
  ([ctx prompt]
   (let [prompt-token-count (count (llutil/tokenize ctx prompt))]
     (eduction
      (take (- context-size prompt-token-count))
      (llama/decode-token ctx)

      (llama/generate-tokens ctx prompt )))))

(defn print-response
  [ctx prompt]
   (transduce
    (take-while (fn [_] (not (Thread/interrupted))))
    (completing (fn [_ s] (print s) (flush)))
    nil
    (my-generate ctx
                 (llama/chat-apply-template
                  ctx
                  [{:role "user"
                    :content prompt}]))))

(print-response ctx "Please tell me a long story.")

It's up to you to decide how to handle responses that exceed the context size. There are many different approaches to either increasing the effective context size or breaking the problem into smaller pieces to work within a smaller context. I've found the LocalLLama subreddit to be a good resource, https://www.reddit.com/r/LocalLLaMA/search/?q=context+size.

hellonico commented 1 month ago

Changing the context size did the trick indeed.

(def ctx 
  (llama/create-context model {:n-gpu-layers 1 :n-ctx 2048}))

Maybe with the current models changing the default would be good ? Or an error message pointing at the context size being the culprit would be great?

Thank you!

phronmophobic commented 1 month ago

While I agree that it would probably make sense to have a higher default context size, I'm also wary of changing behavior of the wrapped library in subtle ways. Advanced usage of llama.clj may be following documentation or advice from the underlying llama.cpp and it can be very frustrating if an example doesn't work because of changes in behavior due to the wrapper itself. As a user, you then have to check the original project and reference the wrapper implementation to figure out how everything works. A design goal is to try to make it so that, when practical, any technique that works for llama.cpp can be applied to llama.clj in a straightforward way.

Or an error message pointing at the context size being the culprit would be great?

It probably makes sense to provide some guidance, somewhere, about context sizes. I'm wary of suggesting that increasing the context size is the "cause" of the exception since there are trade-offs to increasing the context size or the context size might already be at its maximum.

Maybe a mention in the "Getting Started" docs or in an FAQ?

hellonico commented 1 month ago

Thanks for the details! I see.

So up to now I was using LangChain or LllamaIndex. (python world)

In LlamaIndex it has a context window default of 3900. https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/constants.py And this is overwritten when loading the model and the code has access to extra metadata on the model.

From this link, (which agreed is not the best source of truth) https://www.reddit.com/r/LocalLLaMA/comments/16oae0h/how_do_i_find_out_the_context_size_of_a_model/?share_id=J2IDHafqd6him-e518jwv

The context window of most recent models seems to be around at least 4096.

So, yes, I had no idea where to look and why my code with the model output generation was suddently failing.

Maybe a mention in the "Getting Started" docs or in an FAQ?

So yes, maybe making it super extra obvious for people like me that the default context window is not enough for most current model would be very welcome indeed !

phronmophobic commented 1 month ago

Ok, I updated the getting started docs and the doc string for create-context with more info about context sizes.

hellonico commented 1 month ago

Great ! Merci !

phronmophobic commented 1 month ago

Thanks for the feedback!