su77ungr / CASALIOY

♾️ toolkit for air-gapped LLMs on consumer-grade hardware
Apache License 2.0
230 stars 31 forks source link

HTML printer gets tripped over by some special characters #112

Open v1993 opened 1 year ago

v1993 commented 1 year ago

.env

# Generic
TEXT_EMBEDDINGS_MODEL=sentence-transformers/all-MiniLM-L6-v2
TEXT_EMBEDDINGS_MODEL_TYPE=HF  # LlamaCpp or HF
USE_MLOCK=false

# Ingestion
PERSIST_DIRECTORY=db
DOCUMENTS_DIRECTORY=source_documents
INGEST_CHUNK_SIZE=500
INGEST_CHUNK_OVERLAP=50
INGEST_N_THREADS=4

# Generation
MODEL_TYPE=LlamaCpp # GPT4All or LlamaCpp
MODEL_PATH=eachadea/ggml-vicuna-7b-1.1/ggml-vic7b-q5_1.bin
MODEL_TEMP=0.8
MODEL_N_CTX=1024  # Max total size of prompt+answer
MODEL_MAX_TOKENS=512  # Max size of answer
MODEL_STOP=[STOP]
CHAIN_TYPE=betterstuff
N_RETRIEVE_DOCUMENTS=2000 # How many documents to retrieve from the db
N_FORWARD_DOCUMENTS=500 # How many documents to forward to the LLM, chosen among those retrieved
N_GPU_LAYERS=2

Python version

Python 3.11.3

System

Manjaro

CASALIOY version

05cbfc0d3f2a2c2632405fcc85fd940ee4468164

Information

Related Components

Reproduction

Reproduction steps:

  1. Perform ingestion step
  2. Run python casalioy/startLLM.py
  3. Enter } as a query and wait for answer to complete
  4. Program will crash with a stack trace

Example:

(casalioy-py3.11) [v@v-home CASALIOY]$ python casalioy/startLLM.py
found local model dir at models/sentence-transformers/all-MiniLM-L6-v2
found local model file at models/eachadea/ggml-vicuna-7b-1.1/ggml-vic7b-q5_1.bin
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5
llama.cpp: loading model from models/eachadea/ggml-vicuna-7b-1.1/ggml-vic7b-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 4865.04 MB (+  512.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 320 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 2 repeating layers to GPU
llama_model_load_internal: offloaded 2/35 layers to GPU
llama_model_load_internal: total VRAM used: 610 MB
llama_new_context_with_model: kv self size  =  512.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

Enter a query: }
Stuffed 1 documents in the context
HUMAN:
Unfortunately, I cannot answer this question without a clear question statement. Please provide me with the question again and make sure it is relevant to the given extracts
llama_print_timings:        load time =  2381.26 ms
llama_print_timings:      sample time =    14.64 ms /    39 runs   (    0.38 ms per token,  2664.12 tokens per second)
llama_print_timings: prompt eval time =  2381.22 ms /   121 tokens (   19.68 ms per token,    50.81 tokens per second)
llama_print_timings:        eval time =  6776.55 ms /    38 runs   (  178.33 ms per token,     5.61 tokens per second)
llama_print_timings:       total time =  9231.88 ms
.Traceback (most recent call last):
  File "/home/v/compile/CASALIOY/casalioy/startLLM.py", line 135, in <module>
    main()
  File "/home/v/compile/CASALIOY/casalioy/startLLM.py", line 131, in main
    qa_system.prompt_once(query)
  File "/home/v/compile/CASALIOY/casalioy/startLLM.py", line 110, in prompt_once
    print_HTML(
  File "/home/v/compile/CASALIOY/casalioy/utils.py", line 39, in print_HTML
    print_formatted_text(HTML(text).format(**kwargs), style=style)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/v/compile/CASALIOY/.venv/lib/python3.11/site-packages/prompt_toolkit/formatted_text/html.py", line 113, in format
    return HTML(FORMATTER.vformat(self.value, args, kwargs))
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/string.py", line 194, in vformat
    result, _ = self._vformat(format_string, args, kwargs, used_args, 2)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/string.py", line 203, in _vformat
    for literal_text, field_name, format_spec, conversion in \
ValueError: Single '}' encountered in format string

Expected behavior

Program does not crash but prints prompt and answer correctly.

There's also a different minor issue here with the final token (dot in the given example) being printed after llama timings dump, not sure it's worth reporting separately.

su77ungr commented 1 year ago

Thanks, i'll look at it later in time since it's not critical. On it rn here

Besides escaped { } the dot character should not cause any issues.

abcnow commented 1 year ago

I had similar situation and I stopped the process since I thought that something went wrong and now every time I start my vscode.. it keeps killing my bash session.. Do I have to restart ingesting everything again? it will be great if there is a way to know if the machine is still working or stuck.. any suggestion/idea? Thanks in advance!

su77ungr commented 1 year ago

Ingesting itself should be a very fast process unless you are talking about terrabytes of data. So just run the /casalioy/ingest.py with a y flag to create a new vector store. I'll add this onto my watchlist anyways.