Closed 94bb494nd41f closed 1 year ago
@94bb494nd41f https://github.com/hwchase17/langchain/issues/3751#issuecomment-1528768744 fairly certain it has to do with the small context window. Add n_ctx=2048 to args in line 37 and see if that helps.
It kinda Helps but as the conversation grows (two prompts) it again runs out of tokens
Achieving high convective volumes in online HDF.pdf
llama.cpp: loading model from D:\GPT4All-13B-snoozy.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 9031.70 MB (+ 1608.00 MB per state)
.
llama_init_from_file: kv self size = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama_print_timings: load time = 87374.23 ms
llama_print_timings: sample time = 3.64 ms / 13 runs ( 0.28 ms per token)
llama_print_timings: prompt eval time = 236331.20 ms / 1305 tokens ( 181.10 ms per token)
llama_print_timings: eval time = 9774.42 ms / 12 runs ( 814.54 ms per token)
llama_print_timings: total time = 246713.24 ms
Achieving high convective volumes in online HDF.pdf
llama.cpp: loading model from D:\GPT4All-13B-snoozy.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 9031.70 MB (+ 1608.00 MB per state)
.
llama_init_from_file: kv self size = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama_print_timings: load time = 74870.13 ms
llama_print_timings: sample time = 9.08 ms / 34 runs ( 0.27 ms per token)
llama_print_timings: prompt eval time = 267241.56 ms / 1586 tokens ( 168.50 ms per token)
llama_print_timings: eval time = 29863.17 ms / 33 runs ( 904.94 ms per token)
llama_print_timings: total time = 298735.54 ms
Achieving high convective volumes in online HDF.pdf
llama.cpp: loading model from D:\GPT4All-13B-snoozy.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: mem required = 9031.70 MB (+ 1608.00 MB per state)
.
llama_init_from_file: kv self size = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
2023-05-25 20:33:32.238 Uncaught app exception
Traceback (most recent call last):
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 565, in _run_script
exec(code, module.__dict__)
File "C:\Users\derdi\local_llama\local_llama.py", line 141, in <module>
query_index(query_u=user_input)
File "C:\Users\derdi\local_llama\local_llama.py", line 87, in query_index
response = query_engine.query(query_u)
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\query\base.py", line 18, in query
return self._query(str_or_query_bundle)
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\query_engine\retriever_query_engine.py", line 145, in _query
response = self._response_synthesizer.synthesize(
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\query\response_synthesis.py", line 163, in synthesize
response_str = self._response_builder.get_response(
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\response\compact_and_refine.py", line 57, in get_response
response = super().get_response(
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\token_counter\token_counter.py", line 78, in wrapped_llm_predict
f_return_val = f(_self, *args, **kwargs)
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\response\refine.py", line 52, in get_response
response = self._give_response_single(
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\response\refine.py", line 89, in _give_response_single
) = self._service_context.llm_predictor.predict(
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\llm_predictor\base.py", line 244, in predict
llm_prediction = self._predict(prompt, **prompt_args)
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\llm_predictor\base.py", line 212, in _predict
llm_prediction = retry_on_exceptions_with_backoff(
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\utils.py", line 177, in retry_on_exceptions_with_backoff
return lambda_fn()
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\llm_predictor\base.py", line 213, in <lambda>
lambda: llm_chain.predict(**full_prompt_args),
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\llm.py", line 213, in predict
return self(kwargs, callbacks=callbacks)[self.output_key]
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\base.py", line 140, in __call__
raise e
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\base.py", line 134, in __call__
self._call(inputs, run_manager=run_manager)
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\llm.py", line 69, in _call
response = self.generate([inputs], run_manager=run_manager)
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\llm.py", line 79, in generate
return self.llm.generate_prompt(
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\llms\base.py", line 134, in generate_prompt
return self.generate(prompt_strings, stop=stop, callbacks=callbacks)
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\llms\base.py", line 191, in generate
raise e
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\llms\base.py", line 185, in generate
self._generate(prompts, stop=stop, run_manager=run_manager)
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\llms\base.py", line 438, in _generate
else self._call(prompt, stop=stop)
File "C:\Users\derdi\local_llama\local_llama.py", line 41, in _call
output = llm(f"Q: {prompt} A: ", max_tokens=256,
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_cpp\llama.py", line 1101, in __call__
return self.create_completion(
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_cpp\llama.py", line 1055, in create_completion
completion: Completion = next(completion_or_chunks) # type: ignore
File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_cpp\llama.py", line 658, in _create_completion
raise ValueError(
ValueError: Requested tokens exceed context window of 2048
@94bb494nd41f This will be a problem with 99% of models no matter how large you make the context window using n_ctx. The models own limitation comes into play. Even GPT-4 has a context window of only 8,192 tokens. It seems you are approaching the realm of needing to fine tune on your dataset rather than passing it in as context to avoid breaking out of the window and thus having this error.
@jlonge4 the common practice in NLP is to remove the start of the prompt in cases that you exceed the context length rather than simply raising an error.
@yuvalkirstain as in a single prompt?
@jlonge4 if the context length in 2, and my token sequence is [t1 t2 t3], the common practice is to remove the prefix and continue with [t2 t3]. It does not matter if it is a single or batch.
I tried using this with a on a paper (10.1159/000346379) but asking "what is dialysis?" instantly crashes. I am using wizardLM-7B.ggmlv3.q4_0.bin