jlonge4 / local_llama

This repo is to showcase how you can run a model locally and offline, free of OpenAI dependencies.
Apache License 2.0
234 stars 36 forks source link

ValueError: Requested tokens exceed context window of 512 #3

Closed 94bb494nd41f closed 1 year ago

94bb494nd41f commented 1 year ago

I tried using this with a on a paper (10.1159/000346379) but asking "what is dialysis?" instantly crashes. I am using wizardLM-7B.ggmlv3.q4_0.bin

python -m streamlit run local_llama.py

  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://192.168.178.82:8501

Achieving high convective volumes in online HDF.pdf
llama.cpp: loading model from D:\wizardLM-7B.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.72 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama_tokenize: too many tokens
2023-05-24 21:55:06.995 Uncaught app exception
Traceback (most recent call last):
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
  File "C:\Users\derdi\local_llama\local_llama.py", line 139, in <module>
    query_index(query_u=user_input)
  File "C:\Users\derdi\local_llama\local_llama.py", line 85, in query_index
    response = query_engine.query(query_u)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\query\base.py", line 18, in query
    return self._query(str_or_query_bundle)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\query_engine\retriever_query_engine.py", line 145, in _query
    response = self._response_synthesizer.synthesize(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\query\response_synthesis.py", line 163, in synthesize
    response_str = self._response_builder.get_response(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\response\compact_and_refine.py", line 57, in get_response
    response = super().get_response(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\token_counter\token_counter.py", line 78, in wrapped_llm_predict
    f_return_val = f(_self, *args, **kwargs)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\response\refine.py", line 52, in get_response
    response = self._give_response_single(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\response\refine.py", line 89, in _give_response_single
    ) = self._service_context.llm_predictor.predict(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\llm_predictor\base.py", line 244, in predict
    llm_prediction = self._predict(prompt, **prompt_args)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\llm_predictor\base.py", line 212, in _predict
    llm_prediction = retry_on_exceptions_with_backoff(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\utils.py", line 177, in retry_on_exceptions_with_backoff
    return lambda_fn()
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\llm_predictor\base.py", line 213, in <lambda>
    lambda: llm_chain.predict(**full_prompt_args),
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\llm.py", line 213, in predict
    return self(kwargs, callbacks=callbacks)[self.output_key]
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\base.py", line 140, in __call__
    raise e
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\base.py", line 134, in __call__
    self._call(inputs, run_manager=run_manager)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\llm.py", line 69, in _call
    response = self.generate([inputs], run_manager=run_manager)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\llm.py", line 79, in generate
    return self.llm.generate_prompt(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\llms\base.py", line 134, in generate_prompt
    return self.generate(prompt_strings, stop=stop, callbacks=callbacks)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\llms\base.py", line 191, in generate
    raise e
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\llms\base.py", line 185, in generate
    self._generate(prompts, stop=stop, run_manager=run_manager)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\llms\base.py", line 438, in _generate
    else self._call(prompt, stop=stop)
  File "C:\Users\derdi\local_llama\local_llama.py", line 39, in _call
    output = llm(f"Q: {prompt} A: ", max_tokens=256,
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_cpp\llama.py", line 1101, in __call__
    return self.create_completion(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_cpp\llama.py", line 1055, in create_completion
    completion: Completion = next(completion_or_chunks)  # type: ignore
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_cpp\llama.py", line 658, in _create_completion
    raise ValueError(
ValueError: Requested tokens exceed context window of 512
jlonge4 commented 1 year ago

@94bb494nd41f https://github.com/hwchase17/langchain/issues/3751#issuecomment-1528768744 fairly certain it has to do with the small context window. Add n_ctx=2048 to args in line 37 and see if that helps.

94bb494nd41f commented 1 year ago

It kinda Helps but as the conversation grows (two prompts) it again runs out of tokens

Achieving high convective volumes in online HDF.pdf
llama.cpp: loading model from D:\GPT4All-13B-snoozy.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9031.70 MB (+ 1608.00 MB per state)
.
llama_init_from_file: kv self size  = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

llama_print_timings:        load time = 87374.23 ms
llama_print_timings:      sample time =     3.64 ms /    13 runs   (    0.28 ms per token)
llama_print_timings: prompt eval time = 236331.20 ms /  1305 tokens (  181.10 ms per token)
llama_print_timings:        eval time =  9774.42 ms /    12 runs   (  814.54 ms per token)
llama_print_timings:       total time = 246713.24 ms
Achieving high convective volumes in online HDF.pdf
llama.cpp: loading model from D:\GPT4All-13B-snoozy.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9031.70 MB (+ 1608.00 MB per state)
.
llama_init_from_file: kv self size  = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

llama_print_timings:        load time = 74870.13 ms
llama_print_timings:      sample time =     9.08 ms /    34 runs   (    0.27 ms per token)
llama_print_timings: prompt eval time = 267241.56 ms /  1586 tokens (  168.50 ms per token)
llama_print_timings:        eval time = 29863.17 ms /    33 runs   (  904.94 ms per token)
llama_print_timings:       total time = 298735.54 ms
Achieving high convective volumes in online HDF.pdf
llama.cpp: loading model from D:\GPT4All-13B-snoozy.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: mem required  = 9031.70 MB (+ 1608.00 MB per state)
.
llama_init_from_file: kv self size  = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
2023-05-25 20:33:32.238 Uncaught app exception
Traceback (most recent call last):
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 565, in _run_script
    exec(code, module.__dict__)
  File "C:\Users\derdi\local_llama\local_llama.py", line 141, in <module>
    query_index(query_u=user_input)
  File "C:\Users\derdi\local_llama\local_llama.py", line 87, in query_index
    response = query_engine.query(query_u)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\query\base.py", line 18, in query
    return self._query(str_or_query_bundle)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\query_engine\retriever_query_engine.py", line 145, in _query
    response = self._response_synthesizer.synthesize(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\query\response_synthesis.py", line 163, in synthesize
    response_str = self._response_builder.get_response(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\response\compact_and_refine.py", line 57, in get_response
    response = super().get_response(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\token_counter\token_counter.py", line 78, in wrapped_llm_predict
    f_return_val = f(_self, *args, **kwargs)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\response\refine.py", line 52, in get_response
    response = self._give_response_single(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\indices\response\refine.py", line 89, in _give_response_single
    ) = self._service_context.llm_predictor.predict(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\llm_predictor\base.py", line 244, in predict
    llm_prediction = self._predict(prompt, **prompt_args)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\llm_predictor\base.py", line 212, in _predict
    llm_prediction = retry_on_exceptions_with_backoff(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\utils.py", line 177, in retry_on_exceptions_with_backoff
    return lambda_fn()
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_index\llm_predictor\base.py", line 213, in <lambda>
    lambda: llm_chain.predict(**full_prompt_args),
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\llm.py", line 213, in predict
    return self(kwargs, callbacks=callbacks)[self.output_key]
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\base.py", line 140, in __call__
    raise e
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\base.py", line 134, in __call__
    self._call(inputs, run_manager=run_manager)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\llm.py", line 69, in _call
    response = self.generate([inputs], run_manager=run_manager)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\chains\llm.py", line 79, in generate
    return self.llm.generate_prompt(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\llms\base.py", line 134, in generate_prompt
    return self.generate(prompt_strings, stop=stop, callbacks=callbacks)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\llms\base.py", line 191, in generate
    raise e
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\llms\base.py", line 185, in generate
    self._generate(prompts, stop=stop, run_manager=run_manager)
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\langchain\llms\base.py", line 438, in _generate
    else self._call(prompt, stop=stop)
  File "C:\Users\derdi\local_llama\local_llama.py", line 41, in _call
    output = llm(f"Q: {prompt} A: ", max_tokens=256,
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_cpp\llama.py", line 1101, in __call__
    return self.create_completion(
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_cpp\llama.py", line 1055, in create_completion
    completion: Completion = next(completion_or_chunks)  # type: ignore
  File "C:\Users\derdi\.conda\envs\llama_local_pdf_stuff\lib\site-packages\llama_cpp\llama.py", line 658, in _create_completion
    raise ValueError(
ValueError: Requested tokens exceed context window of 2048
jlonge4 commented 1 year ago

@94bb494nd41f This will be a problem with 99% of models no matter how large you make the context window using n_ctx. The models own limitation comes into play. Even GPT-4 has a context window of only 8,192 tokens. It seems you are approaching the realm of needing to fine tune on your dataset rather than passing it in as context to avoid breaking out of the window and thus having this error.

yuvalkirstain commented 1 year ago

@jlonge4 the common practice in NLP is to remove the start of the prompt in cases that you exceed the context length rather than simply raising an error.

jlonge4 commented 1 year ago

@yuvalkirstain as in a single prompt?

yuvalkirstain commented 1 year ago

@jlonge4 if the context length in 2, and my token sequence is [t1 t2 t3], the common practice is to remove the prefix and continue with [t2 t3]. It does not matter if it is a single or batch.