thomasantony / llamacpp-python

Python bindings for llama.cpp
198 stars 28 forks source link

Segmentation fault for generations larger than ~512 tokens #13

Open horenbergerb opened 1 year ago

horenbergerb commented 1 year ago

Running on Ubuntu, 32GB RAM. I get a segmentation fault by running the following code:

import sys
import llamacpp

def progress_callback(progress):
    print("Progress: {:.2f}%".format(progress * 100))
    sys.stdout.flush()

params = llamacpp.InferenceParams.default_with_callback(progress_callback)
params.path_model = '/home/captdishwasher/horenbergerb/llama/llama.cpp/models/30Bnew/ggml-model-q4_0-ggjt.bin'
model = llamacpp.LlamaInference(params)

prompt = "1"*500
prompt_tokens = model.tokenize(prompt, True)
print('Prompt tokens: {}'.format(len(prompt_tokens)))
model.add_bos()
model.update_input(prompt_tokens)

model.ingest_all_pending_input()
print(model.system_info())
for i in range(20):
    model.eval()
    token = model.sample()
    text = model.token_to_str(token)
    print(text, end="", flush=True)

# Flush stdout
sys.stdout.flush()

model.print_timings()

Output:

...
Prompt tokens: 501
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
111111101111Segmentation fault (core dumped)

Possibly related to context or something? The number 512 matches the default n_ctx, but raising n_ctx didn't fix the problem... This has been coming up for users of text-generation-web-ui, which uses this package: https://github.com/oobabooga/text-generation-webui/issues/690

horenbergerb commented 1 year ago

Oh, strange... Updating the context like this:

params = llamacpp.InferenceParams.default_with_callback(progress_callback)
params.path_model = '/home/captdishwasher/horenbergerb/llama/llama.cpp/models/30Bnew/ggml-model-q4_0-ggjt.bin'
params.n_ctx = 2048
model = llamacpp.LlamaInference(params)

Did not change the context in the output logs:

(textgen) captdishwasher@captainofthedishwasher-MS-7D43:~/horenbergerb/llamacpp-python$ python crash_example.py 
llama_model_load: loading model from '/home/captdishwasher/horenbergerb/llama/llama.cpp/models/30Bnew/ggml-model-q4_0-ggjt.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 6656
...

So maybe raising n_ctx would fix the problem if it could propagate properly EDIT: raising n_ctx would push the problem further out, but the threat of a segfault remains when your prompt gets bigger than n_ctx

horenbergerb commented 1 year ago

Here's some relevant code in llama.cpp. This seems to be where the trick is revealed for how to get "infinite generation via context swapping"

thomasantony commented 1 year ago

@horenbergerb I didn't add anything for the "infinite generation" behavior in the LlamaInference wrapper. It is possible that there is something in the underlying code that assumes that you won't exceed the context size.

thomasantony commented 1 year ago

@horenbergerb Have you tried this recently? Right now the bindings still fail if you exceed the context size. However, you can now set the context size using params.