Closed vulcano9 closed 9 months ago
Hi @vulcano9
The output size can be controlled by max_tokens
in the config, you can try to change it in the notebook here - https://github.com/snexus/llm-search/blob/d0f756df9fae8ec8786550b0fdcd94c8306f5589/notebooks/llmsearch_google_colab_demo.ipynb#L103
The config in the notebook is just an example and can be tweaked to fit your use case/model. One limitation though is the amount of GPU memory on the free Google Colab. Pay attention to the GPU memory available, and try to increase the parameter without exceeding the memory.
Another option might be to use a smaller model on the free Google Colab.
Hello @snexus,
Thank you very much for creating this project!
I am using google colab with the provided template and ingested 20 PDF documents. The embeddings have been generated without any problem and I can query the llm, but the response is always truncated / there seems to be an error in generating the response (see below).
Thank you very much for your help!
Here is an example (copied from Google Colab): ` 2024-01-08 13:53:35.106 | INFO | llmsearch.config:validate_params:165 - Loading model paramaters in configuration class LlamaModelConfig 2024-01-08 13:53:35.106 | INFO | llmsearch.utils:set_cache_folder:43 - Setting SENTENCE_TRANSFORMERS_HOME folder: /content/llm/cache 2024-01-08 13:53:35.106 | INFO | llmsearch.utils:set_cache_folder:44 - Setting TRANSFORMERS_CACHE folder: /content/llm/cache/transformers 2024-01-08 13:53:35.106 | INFO | llmsearch.utils:set_cache_folder:45 - Setting HF_HOME: /content/llm/cache/hf_home 2024-01-08 13:53:35.106 | INFO | llmsearch.utils:set_cache_folder:46 - Setting MODELS_CACHE_FOLDER: /content/llm/cache 2024-01-08 13:53:35.106 | INFO | llmsearch.models.llama:model:134 - Loading model... 2024-01-08 13:53:35.107 | INFO | llmsearch.models.llama:model:137 - Initializing LLAmaCPP model... 2024-01-08 13:53:35.107 | INFO | llmsearch.models.llama:model:138 - {'n_ctx': 1024, 'n_batch': 512, 'n_gpu_layers': 43} ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: Tesla T4, compute capability 7.5, VMM: yes llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /content/llm/models/airoboros-l2-13b-gpt4-1.4.1.Q4_K_M.gguf (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 5120 llama_model_loader: - kv 4: llama.block_count u32 = 40 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 40 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 15 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: system memory used = 88.03 MiB
llm_load_tensors: VRAM used = 7412.96 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
...................................................................................................
llama_new_context_with_model: n_ctx = 1024
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: KV self size = 800.00 MiB, K (f16): 400.00 MiB, V (f16): 400.00 MiB
llama_build_graph: non-view tensors processed: 844/844
llama_new_context_with_model: compute buffer total size = 115.19 MiB
llama_new_context_with_model: VRAM scratch buffer: 112.00 MiB
llama_new_context_with_model: total VRAM used: 7524.96 MiB (model: 7412.96 MiB, context: 112.00 MiB)
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
2024-01-08 13:53:58.464 | INFO | llmsearch.embeddings:get_embedding_model:65 - Embedding model config: type=<EmbeddingModelType.instruct: 'instruct'> model_name='hkunlp/instructor-large' additional_kwargs={}
load INSTRUCTOR_Transformer
2024-01-08 13:53:59.968343: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-08 13:53:59.968395: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-08 13:53:59.975660: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-08 13:54:01.951666: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
max_seq_length 512
2024-01-08 13:54:25.921 | INFO | llmsearch.ranking:init:39 - Initialized BGE-base Reranker
2024-01-08 13:54:29.218 | INFO | llmsearch.splade:init:33 - Setting device to cuda:0
2024-01-08 13:54:35.700 | INFO | llmsearch.splade:load:100 - SPLADE: Got 0 labels.
2024-01-08 13:54:35.700 | INFO | llmsearch.splade:load:104 - Loaded sparse (SPLADE) embeddings from /content/llm/embeddings/splade/splade_embeddings.npz
2024-01-08 13:54:35.700 | INFO | llmsearch.utils:get_hyde_chain:110 - Creating HyDE chain...
2024-01-08 13:54:35.701 | INFO | llmsearch.utils:get_multiquery_chain:117 - Creating MultiQUery chain...
", "", "<0x00>", "<... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 18: general.quantization_version u32 = 2 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type q4_K: 241 tensors llama_model_loader: - type q6_K: 41 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 5120 llm_load_print_meta: n_embd_v_gqa = 5120 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 13824 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 13.02 B llm_load_print_meta: model size = 7.33 GiB (4.83 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 'ENTER QUESTION >> Could you provide me with some of the best methods for effectively marketing a product 2024-01-08 13:54:39.989 | DEBUG | llmsearch.ranking:get_relevant_documents:84 - Evaluating query: Could you provide me with some of the best methods for effectively marketing a product 2024-01-08 13:54:39.990 | INFO | llmsearch.splade:query:208 - SPLADE search will search over all documents of chunk size: 1024. Number of docs: 1519 [0.00914291 0.0224314 0.01972341 ... 0.02577811 0.03467241 0.02387246] 2024-01-08 13:54:42.806 | INFO | llmsearch.ranking:get_relevant_documents:92 - Stage 1: Got 15 documents. 2024-01-08 13:54:42.806 | INFO | llmsearch.ranking:get_relevant_documents:104 - Dense embeddings filter: None 2024-01-08 13:54:44.367 | DEBUG | llmsearch.ranking:get_relevant_documents:113 - NUMBER OF NEW DOCS to RETRIEVE: 25 2024-01-08 13:54:44.382 | INFO | llmsearch.ranking:rerank:51 - Reranking documents ... 2024-01-08 13:54:44.382 | INFO | llmsearch.ranking:get_scores:42 - Reranking documents ... [-4.017726898193359, -5.740996360778809, -3.4490861892700195, -6.219937801361084, -0.5473086833953857, -7.063520431518555, -6.515655994415283, -9.21086311340332, -6.386524677276611, -7.356011390686035, -6.924832820892334, -8.909055709838867, -7.650751113891602, -6.111538410186768, -7.745747089385986, -5.9694342613220215, -7.448235988616943, -6.252921104431152, -6.285423278808594, -6.576879501342773, -7.744513511657715, -8.150556564331055, -6.460150718688965, -7.074395179748535, -4.118349552154541] 2024-01-08 13:55:04.713 | INFO | llmsearch.ranking:rerank:59 - [-0.5473086833953857, -3.4490861892700195, -4.017726898193359, -4.118349552154541, -5.740996360778809, -5.9694342613220215, -6.111538410186768, -6.219937801361084, -6.252921104431152, -6.285423278808594, -6.386524677276611, -6.460150718688965, -6.515655994415283, -6.576879501342773, -6.924832820892334, -7.063520431518555, -7.074395179748535, -7.356011390686035, -7.448235988616943, -7.650751113891602, -7.744513511657715, -7.745747089385986, -8.150556564331055, -8.909055709838867, -9.21086311340332] 2024-01-08 13:55:04.714 | INFO | llmsearch.ranking:get_relevant_documents:131 - New most relevant query: Could you provide me with some of the best methods for effectively marketing a product 2024-01-08 13:55:04.714 | INFO | llmsearch.ranking:get_relevant_documents:138 - Number of documents after stage 2 (dense + sparse): 25 2024-01-08 13:55:04.714 | INFO | llmsearch.ranking:get_relevant_documents:141 - Re-ranker avg. scores for top 5 resuls, chunk size 1024: -3.57 [chain/start] [1:chain:StuffDocumentsChain] Entering Chain run with input: [inputs] [chain/start] [1:chain:StuffDocumentsChain > 2:chain:LLMChain] Entering Chain run with input: { "question": "Could you provide me with some of the best methods for effectively marketing a product", "context": "And, this book came out of my studies and experiences, you know, like \nresearching and reading and just living my life in a high pressure, high stakes \nenvironment. And I know that seems weird, but just like the best marketing \ndecision you can make for a product is to have a really good product that people \nwant, the best way to have writing that people want is to live a life and have \nexperienced the world in a way that allows you to communicate something to \npeople that they'd never heard before. \n \nI think it's especially true in fiction because at least in non-fiction, someone can \ngo out and study and objectively find, academics can write good non-fiction \nbooks based on their research. But, non-fiction, you have to be able to \ncommunicate all these intangibles to the reader. \n \nTim Ferriss: \nYou mean in fiction. \n \nRyan Holiday: Yeah, I'm sorry, in fiction. Yeah. You have to communicate all these intangibles \nabout life and relationships and how the world works. And if you haven't gone\n\nbest practices for marketing bestselling books. There are very few consensuses \nabout the best way to write a best-reading book, if that makes sense. \n \nI mean, that's part of the reason why I fell in love with \"Daily Rituals,\" which \nprofiles 170 or so world-famous creatives, whether it's writers, composers, \nscientists, etc. and how their daily schedules are laid out because they're so \ndifferent. It's really fascinating to me. \n \nDo you watch documentaries? If so, what are your favorite documentaries that \ncome to mind? \n \nRyan Holiday: I love documentaries. But I don't watch that much TV. So, I don't get to watch \nas many as I like because... yeah. But some favorites, I like \"Fog of War,\" I \nthink is amazing. \n \nThat Phil Spector documentary from a couple years ago is pretty crazy. I think \nit's called \"The Wall of Sound,\" but I forget what it's called exactly. There's the \nguy who did Fog of War has a new one out about Donald Rumsfeld that I want \nto see called the Unknown Known.\n\nSo, I want to talk about the most effective pair of productivity techniques that I have \ncome across since 2004 that have helped me up until this point test the uncommon \ndespite the fear of ridicule, criticism, failure, and so forth. And both techniques – I \ncheated a bit with the format. Some things we will repeat – are borrowed from stoicism, \nwhich was a school of philosophy from the Hellenistic period used by a lot of the Greco \nroman educated elite, including emperors, and military, and statesmen.\n\nthey wanted to plug by coming on the show? \n \nNeil Strauss: \nOh, yeah. I'll plug for you. I'll always tell somebody, and this is \ntrue: When you're going on, and you're trying to promote your \nbusiness, or your brand, or your book, or movie – whatever you're" } [llm/start] [1:chain:StuffDocumentsChain > 2:chain:LLMChain > 3:llm:CustomLlamaLangChainModel] Entering LLM run with input: { "prompts": [ "### Instruction:\nUse the following pieces of context to provide detailed answer the question at the end. If answer isn't in the context, say that you don't know, don't try to make up an answer.\n\n### Context:\n---------------\nAnd, this book came out of my studies and experiences, you know, like \nresearching and reading and just living my life in a high pressure, high stakes \nenvironment. And I know that seems weird, but just like the best marketing \ndecision you can make for a product is to have a really good product that people \nwant, the best way to have writing that people want is to live a life and have \nexperienced the world in a way that allows you to communicate something to \npeople that they'd never heard before. \n \nI think it's especially true in fiction because at least in non-fiction, someone can \ngo out and study and objectively find, academics can write good non-fiction \nbooks based on their research. But, non-fiction, you have to be able to \ncommunicate all these intangibles to the reader. \n \nTim Ferriss: \nYou mean in fiction. \n \nRyan Holiday: Yeah, I'm sorry, in fiction. Yeah. You have to communicate all these intangibles \nabout life and relationships and how the world works. And if you haven't gone\n\nbest practices for marketing bestselling books. There are very few consensuses \nabout the best way to write a best-reading book, if that makes sense. \n \nI mean, that's part of the reason why I fell in love with \"Daily Rituals,\" which \nprofiles 170 or so world-famous creatives, whether it's writers, composers, \nscientists, etc. and how their daily schedules are laid out because they're so \ndifferent. It's really fascinating to me. \n \nDo you watch documentaries? If so, what are your favorite documentaries that \ncome to mind? \n \nRyan Holiday: I love documentaries. But I don't watch that much TV. So, I don't get to watch \nas many as I like because... yeah. But some favorites, I like \"Fog of War,\" I \nthink is amazing. \n \nThat Phil Spector documentary from a couple years ago is pretty crazy. I think \nit's called \"The Wall of Sound,\" but I forget what it's called exactly. There's the \nguy who did Fog of War has a new one out about Donald Rumsfeld that I want \nto see called the Unknown Known.\n\nSo, I want to talk about the most effective pair of productivity techniques that I have \ncome across since 2004 that have helped me up until this point test the uncommon \ndespite the fear of ridicule, criticism, failure, and so forth. And both techniques – I \ncheated a bit with the format. Some things we will repeat – are borrowed from stoicism, \nwhich was a school of philosophy from the Hellenistic period used by a lot of the Greco \nroman educated elite, including emperors, and military, and statesmen.\n\nthey wanted to plug by coming on the show? \n \nNeil Strauss: \nOh, yeah. I'll plug for you. I'll always tell somebody, and this is \ntrue: When you're going on, and you're trying to promote your \nbusiness, or your brand, or your book, or movie – whatever you're\n---------------\n\n### Question: Could you provide me with some of the best methods for effectively marketing a product\n### Response:" ] } The most effective pair of productivity techniques that I have come across since 2004 that have helped me up until this point test the uncommon despite the fear of ridicule, criticism, failure, and so forth. And both techniques – I cheated a bit with the format. Some things we will repeat – are borrowed from stoicism, which was a school of philosophy from the Hellenistic period used by a lot of the Greco roman educated elite, including emperors, and military, and statesmen.
They wanted to plug by coming on the show? Oh, yeah. I'll plug for you. I'll always tell somebody, and llama_print_timings: load time = 8475.15 ms llama_print_timings: sample time = 92.82 ms / 144 runs ( 0.64 ms per token, 1551.32 tokens per second) llama_print_timings: prompt eval time = 16788.27 ms / 880 tokens ( 19.08 ms per token, 52.42 tokens per second) llama_print_timings: eval time = 29640.72 ms / 143 runs ( 207.28 ms per token, 4.82 tokens per second) llama_print_timings: total time = 47041.79 ms [llm/end] [1:chain:StuffDocumentsChain > 2:chain:LLMChain > 3:llm:CustomLlamaLangChainModel] [47.06s] Exiting LLM run with output: { "generations": [ [ { "text": "The most effective pair of productivity techniques that I have come across since 2004 that have helped me up until this point test the uncommon despite the fear of ridicule, criticism, failure, and so forth. And both techniques – I cheated a bit with the format. Some things we will repeat – are borrowed from stoicism, which was a school of philosophy from the Hellenistic period used by a lot of the Greco roman educated elite, including emperors, and military, and statesmen.\n\nThey wanted to plug by coming on the show? \nOh, yeah. I'll plug for you. I'll always tell somebody, and", "generation_info": null } ] ], "llm_output": null, "run": null } [chain/end] [1:chain:StuffDocumentsChain > 2:chain:LLMChain] [47.06s] Exiting Chain run with output: { "text": "The most effective pair of productivity techniques that I have come across since 2004 that have helped me up until this point test the uncommon despite the fear of ridicule, criticism, failure, and so forth. And both techniques – I cheated a bit with the format. Some things we will repeat – are borrowed from stoicism, which was a school of philosophy from the Hellenistic period used by a lot of the Greco roman educated elite, including emperors, and military, and statesmen.\n\nThey wanted to plug by coming on the show? \nOh, yeah. I'll plug for you. I'll always tell somebody, and" } [chain/end] [1:chain:StuffDocumentsChain] [47.06s] Exiting Chain run with output: { "output_text": "The most effective pair of productivity techniques that I have come across since 2004 that have helped me up until this point test the uncommon despite the fear of ridicule, criticism, failure, and so forth. And both techniques – I cheated a bit with the format. Some things we will repeat – are borrowed from stoicism, which was a school of philosophy from the Hellenistic period used by a lot of the Greco roman educated elite, including emperors, and military, and statesmen.\n\nThey wanted to plug by coming on the show? \nOh, yeah. I'll plug for you. I'll always tell somebody, and" }
============= SOURCES ================== sample_docs/15-neil-strauss.pdf {'chunk_size': 1024, 'document_id': 'dcbf9a82-ae20-11ee-b544-0242ac1c000c', 'label': '', 'page': 20, 'score': -7.356011390686035} ** BEING EXTRACT they wanted to plug by coming on the show?
Neil Strauss: Oh, yeah. I'll plug for you. I'll always tell somebody, and this is true: When you're going on, and you're trying to promote your business, or your brand, or your book, or movie – whatever you're
sample_docs/17-tim-ferriss-the-power-of-negative-visualization.pdf {'chunk_size': 1024, 'document_id': 'dc58d00e-ae20-11ee-b544-0242ac1c000c', 'label': '', 'page': 0, 'score': -4.017726898193359} ** BEING EXTRACT So, I want to talk about the most effective pair of productivity techniques that I have come across since 2004 that have helped me up until this point test the uncommon despite the fear of ridicule, criticism, failure, and so forth. And both techniques – I cheated a bit with the format. Some things we will repeat – are borrowed from stoicism, which was a school of philosophy from the Hellenistic period used by a lot of the Greco roman educated elite, including emperors, and military, and statesmen.
sample_docs/04-ryan-holiday.pdf {'chunk_size': 1024, 'document_id': 'dca64352-ae20-11ee-b544-0242ac1c000c', 'label': '', 'page': 19, 'score': -3.4490861892700195} ** BEING EXTRACT best practices for marketing bestselling books. There are very few consensuses about the best way to write a best-reading book, if that makes sense.
I mean, that's part of the reason why I fell in love with "Daily Rituals," which profiles 170 or so world-famous creatives, whether it's writers, composers, scientists, etc. and how their daily schedules are laid out because they're so different. It's really fascinating to me.
Do you watch documentaries? If so, what are your favorite documentaries that come to mind?
Ryan Holiday: I love documentaries. But I don't watch that much TV. So, I don't get to watch as many as I like because... yeah. But some favorites, I like "Fog of War," I think is amazing.
That Phil Spector documentary from a couple years ago is pretty crazy. I think it's called "The Wall of Sound," but I forget what it's called exactly. There's the guy who did Fog of War has a new one out about Donald Rumsfeld that I want to see called the Unknown Known.
sample_docs/04-ryan-holiday.pdf {'chunk_size': 1024, 'document_id': 'dca64a32-ae20-11ee-b544-0242ac1c000c', 'label': '', 'page': 22, 'score': -0.5473086833953857} ** BEING EXTRACT And, this book came out of my studies and experiences, you know, like researching and reading and just living my life in a high pressure, high stakes environment. And I know that seems weird, but just like the best marketing decision you can make for a product is to have a really good product that people want, the best way to have writing that people want is to live a life and have experienced the world in a way that allows you to communicate something to people that they'd never heard before.
I think it's especially true in fiction because at least in non-fiction, someone can go out and study and objectively find, academics can write good non-fiction books based on their research. But, non-fiction, you have to be able to communicate all these intangibles to the reader.
Tim Ferriss:
You mean in fiction.
Ryan Holiday: Yeah, I'm sorry, in fiction. Yeah. You have to communicate all these intangibles about life and relationships and how the world works. And if you haven't gone
============= RESPONSE ================= The most effective pair of productivity techniques that I have come across since 2004 that have helped me up until this point test the uncommon despite the fear of ridicule, criticism, failure, and so forth. And both techniques – I cheated a bit with the format. Some things we will repeat – are borrowed from stoicism, which was a school of philosophy from the Hellenistic period used by a lot of the Greco roman educated elite, including emperors, and military, and statesmen.
They wanted to plug by coming on the show? Oh, yeah. I'll plug for you. I'll always tell somebody, and
ENTER QUESTION >>
`