h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
http://h2o.ai
Apache License 2.0
11.39k stars 1.25k forks source link

CUDA out of memory error when querying All documents #1387

Closed Blacksuan19 closed 8 months ago

Blacksuan19 commented 8 months ago

I am getting a cuda out of memory error when querying against All data in a collection, the collection is large and has more data than what can fit in the context, I'd expect h2ogpt to take enough data to fill the context when querying against the entire collection, is there an option for that?

querying a selection from a collection

the selected files are smaller than the model context of 16k  ![image](https://github.com/h2oai/h2ogpt/assets/10248473/1936ac95-5fe1-4019-ac71-be11b8beccf3)

querying against the entire collection

the entire collection size comes to larger than 16k tokens ![image](https://github.com/h2oai/h2ogpt/assets/10248473/62c139ae-241b-4b99-a151-7686aa30a7ff)
full run command ```bash docker run \ --gpus all \ --runtime=nvidia \ --shm-size=2g \ -p 7860:7860 \ -v /etc/passwd:/etc/passwd:ro \ -v /etc/group:/etc/group:ro \ -u $(id -u):$(id -g) \ -v "${HOME}"/h2ogpt_mistral/.cache:/workspace/.cache \ -v "${HOME}"/h2ogpt_mistral/save:/workspace/save \ -v "${HOME}"/h2ogpt_mistral/user_path:/workspace/user_path \ -v "${HOME}"/h2ogpt_mistral/db_dir_UserData:/workspace/db_dir_UserData \ -v "${HOME}"/h2ogpt_mistral/users:/workspace/users \ -v "${HOME}"/h2ogpt_mistral/db_nonusers:/workspace/db_nonusers \ -v "${HOME}"/h2ogpt_mistral/auth:/workspace/auth \ -v "${HOME}"/h2ogpt_mistral/assets:/workspace/assets \ gcr.io/vorvan/h2oai/h2ogpt-runtime:$IMAGE_TAG /workspace/generate.py \ --openai_server=False \ --h2ogpt_api_keys="/workspace/auth/api_keys.json" \ --use_gpu_id=False \ --score_model=None \ --prompt_type=open_chat \ --base_model=TheBloke/openchat_3.5-16k-AWQ \ --compile_model=True \ --use_cache=True \ --use_flash_attention_2=True \ --attention_sinks=True \ --sink_dict="{'num_sink_tokens': 4, 'window_length': $CONTEXT_LENGTH }" \ --save_dir='/workspace/save/' \ --user_path='/workspace/user_path/' \ --langchain_mode="UserData" \ --langchain_modes="['UserData', 'LLM']" \ --visible_langchain_actions="['Query']" \ --visible_langchain_agents="[]" \ --use_llm_if_no_docs=True \ --max_seq_len=$CONTEXT_LENGTH \ --enable_ocr=True \ --enable_tts=False \ --enable_stt=False ```
pseudotensor commented 8 months ago

FYI the issue doesn't have directly to do with largeness of the collection. That doesn't cause GPU OOM.

It's probably caused by too large context that your GPU can't handle. You can try reducing several related things:

top_k_docs (smaller than default of 10)
max_input_tokens (set to some amount, e.g. 2048 instead of default that is based upon top_k_docs or model limit)
max_total_input_tokens (matters for summarization, while max_input_tokens is per-context use)

Will close for now, feel free to ask more.

Blacksuan19 commented 8 months ago

the error still occurs after setting top_k_docs=1 and max_input_tokens=10024, max_total_input_tokens=10024 despite being on query mode, for reference the context size is 16384. here is a full log

```python Traceback (most recent call last): File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/queueing.py", line 495, in call_prediction output = await route_utils.call_process_api( File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/route_utils.py", line 230, in call_process_api output = await app.get_blocks().process_api( File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/blocks.py", line 1590, in process_api result = await self.call_function( File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/blocks.py", line 1188, in call_function prediction = await utils.async_iteration(iterator) File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 502, in async_iteration return await iterator.__anext__() File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 495, in __anext__ return await anyio.to_thread.run_sync( File "/h2ogpt_conda/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/h2ogpt_conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "/h2ogpt_conda/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run result = context.run(func, *args) File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 478, in run_sync_iterator_async return next(iterator) File "/h2ogpt_conda/lib/python3.10/site-packages/gradio/utils.py", line 661, in gen_wrapper response = next(iterator) File "/workspace/src/gradio_runner.py", line 4454, in bot for res in get_response(fun1, history, chatbot_role1, speaker1, tts_language1, roles_state1, File "/workspace/src/gradio_runner.py", line 4349, in get_response for output_fun in fun1(): File "/workspace/src/gen.py", line 3781, in evaluate for r in run_qa_db( File "/workspace/src/gpt_langchain.py", line 5489, in _run_qa_db answer = yield from run_target_func(query=query, File "/workspace/src/gpt_langchain.py", line 5646, in run_target raise thread.exc File "/workspace/src/utils.py", line 472, in run self._return = self._target(*self._args, **self._kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/langchain/chains/base.py", line 316, in __call__ raise e File "/h2ogpt_conda/lib/python3.10/site-packages/langchain/chains/base.py", line 310, in __call__ self._call(inputs, run_manager=run_manager) File "/h2ogpt_conda/lib/python3.10/site-packages/langchain/chains/combine_documents/base.py", line 136, in _call output, extra_return_dict = self.combine_docs( File "/h2ogpt_conda/lib/python3.10/site-packages/langchain/chains/combine_documents/stuff.py", line 244, in combine_docs return self.llm_chain.predict(callbacks=callbacks, **inputs), {} File "/h2ogpt_conda/lib/python3.10/site-packages/langchain/chains/llm.py", line 293, in predict return self(kwargs, callbacks=callbacks)[self.output_key] File "/h2ogpt_conda/lib/python3.10/site-packages/langchain/chains/base.py", line 316, in __call__ File "/h2ogpt_conda/lib/python3.10/site-packages/langchain/chains/base.py", line 310, in __call__ self._call(inputs, run_manager=run_manager) File "/h2ogpt_conda/lib/python3.10/site-packages/langchain/chains/llm.py", line 103, in _call response = self.generate([inputs], run_manager=run_manager) File "/h2ogpt_conda/lib/python3.10/site-packages/langchain/chains/llm.py", line 115, in generate return self.llm.generate_prompt( File "/h2ogpt_conda/lib/python3.10/site-packages/langchain_core/language_models/llms.py", line 521, in generate_prompt return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/langchain_core/language_models/llms.py", line 671, in generate output = self._generate_helper( File "/h2ogpt_conda/lib/python3.10/site-packages/langchain_core/language_models/llms.py", line 558, in _generate_helper raise e File "/h2ogpt_conda/lib/python3.10/site-packages/langchain_core/language_models/llms.py", line 545, in _generate_helper self._generate( File "/workspace/src/gpt_langchain.py", line 1719, in _generate rets = super()._generate(prompts, stop=stop, run_manager=run_manager, **kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/langchain_community/llms/huggingface_pipeline.py", line 203, in _generate responses = self.pipeline(batch_prompts) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 219, in __call__ return super().__call__(text_inputs, **kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1143, in __call__ outputs = list(final_iterator) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__ item = next(self.iterator) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__ processed = self.infer(item, **self.params) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1068, in forward model_outputs = self._forward(model_inputs, **forward_params) File "/workspace/src/h2oai_pipeline.py", line 260, in _forward return self.__forward(model_inputs, **generate_kwargs) File "/workspace/src/h2oai_pipeline.py", line 296, in __forward generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 1479, in generate return self.greedy_search( File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/generation/utils.py", line 2340, in greedy_search outputs = self( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1154, in forward outputs = self.model( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/awq/modules/fused/model.py", line 101, in forward h, _, past_key_value = layer( File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/h2ogpt_conda/lib/python3.10/site-packages/awq/modules/fused/block.py", line 65, in forward attn_output, _, past_key_value = self.attn.forward( File "/h2ogpt_conda/lib/python3.10/site-packages/awq/modules/fused/attn.py", line 200, in forward scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim) ```
pseudotensor commented 8 months ago

Keep decreasing max_input_tokens until it works. But top_k_docs of 1 should also have worked.

You can set --verbose=True and see what is being passed as the real prompt. See what is going on.