Closed ChiNoel-osu closed 10 months ago
Commit: https://github.com/h2oai/h2ogpt/commit/635cefdd2634845e743000e09e47f88c9c23056e Env: Linux full install
First, this is the prompt structure according to their chat code. Custom prompt used here:
python generate.py --base_model=THUDM/chatglm3-6b --prompt_type=custom --prompt_dict="{'PreInstruct': '<|user|>', 'PreResponse': '<|observation|>', 'chat_sep': '\n', 'chat_turn_sep': '\n', 'humanstr': '<|user|>', 'botstr': '<|observation|>', 'terminate_response': ['<|user|>']}"
following: https://github.com/h2oai/h2ogpt/blob/main/docs/FAQ.md#adding-prompt-templates
Chat works fast like you said:
But Q/A is slow to start, yes. I see same thing.
I did install their package:
pip install cpm_kernels
So I suspect that their kernels are just bad, nothing to do with h2oGPT etc.
Basically model is messed up when context is filled some, over-use of CPU etc.
Thanks for the info. I did notice max CPU usage during the "stuck" phrase. But this didn't happen with another RAG app Langchain-Chatchat. It uses FastChat though, don't know if that's the difference.
Maybe probably limit the use of context even more to avoid the slowness. if that's true, you can do same by passing --max_input_tokens
to something smaller in CLI/UI/API. Or just set --max_seq_len
to smaller. Or set --top_k_docs=3
(i.e. smaller).
There's nothing h2oGPT does itself, it's just transformers like the code block there:
https://huggingface.co/THUDM/chatglm3-6b#%E4%BB%A3%E7%A0%81%E8%B0%83%E7%94%A8-code-usage
i.e. you should be able to use the same code without h2oGPT but pass in a large text for context, ask some question like summarize, and it'll behave same.
If you try and see it fast, let me know.
All test below is conducted with h2oGPT and ChatGLM3-6B LLM
I submitted a simple query, it gets stuck at the start for a moment and then finally produces expected result (accuracy aside).
I went to stdout and copied the same final prompt from the last test, and then I submit that directly. It also gives expected result. There was no "stuck", answer was instant, just like normal chat mode.
Therefore I think the model is working fine and has no problem with large context.
Hi, not sure what you are copy pasting, and seems like small context, not long document.
If you can give specific reproducible example I can check, but screen shots don't allow that. Thanks.
If we use "RelSources" which I believe only shows the relative document and doesn't actually involve LLM processing. There's still the same delay of about 15 seconds at the start.
Also I've tested:
<|user|>Pay attention and remember the information below, which will help to answer the question or imperative after the context ends.
"""
h2oGPT
Turn ★ into ⭐ (top-right corner) if you like the project!
Query and summarize your documents or just chat with local private GPT LLMs using h2oGPT, an Apache V2 open-source project.
Private offline database of any documents (PDFs, Excel, Word, Images, Video Frames, Youtube, Audio, Code, Text, MarkDown, etc.)
Persistent database (Chroma, Weaviate, or in-memory FAISS) using accurate embeddings (instructor-large, all-MiniLM-L6-v2, etc.)
Any CLI argument from python generate.py --help with environment variable set as h2ogpt_x, e.g. h2ogpt_h2ocolors to False. Set env h2ogpt_server_name to actual IP address for LAN to see app, e.g. h2ogpt_server_name to 192.168.1.172 and allow access through firewall if have Windows Defender activated. One can tweak installed h2oGPT code at, e.g. C:\Users\pseud\AppData\Local\Programs\h2oGPT. To terminate the app, go to System Tab and click Admin and click Shutdown h2oGPT.
If startup fails, run as console and check for errors, e.g. and kill any old Python processes. Full Windows 10/11 Manual Installation Script Single .bat file for installation (if do not skip any optional packages, takes about 9GB filled on disk). Recommend base Conda env, which allows for DocTR that requires pygobject that has otherwise no support (except mysys2 that cannot be used by h2oGPT). Also allows for TTS package by Coqui, which is otherwise not enabled currently in one-click installer.
For any platform, some packages download models at runtime, like for DocTR, Unstructured, BLIP, Stable Diffusion, etc. that appear to delay operations in the UI. The progress appears in the console logs. Windows 10/11 64-bit with full document Q/A capability One-Click Installer CPU or GPU: Download h2oGPT Windows Installer (1.3GB file)
H2O.ai have built several world-class Machine Learning, Deep Learning and AI platforms:
H2O-3
These one-click installers are experimental. Report any issues with steps to reproduce at https://github.com/h2oai/h2ogpt/issues. Note: The app bundle is unsigned. If you experience any issues with running the app, run the following commands: bash $ xattr -dr com.apple.quarantine {file-path}/h2ogpt-osx-m1-gpu $ chmod +x {file-path}/h2ogpt-osx-m1-gpu
To create a development environment for training and generation, follow the installation instructions. To fine-tune any LLM models on your data, follow the fine-tuning instructions. To run h2oGPT tests: bash pip install requirements-parser pytest-instafail pytest-random-order pip install playsound==1.3.0 pytest --instafail -s -v tests
make -C client setup make -C client build pytest --instafail -s -v client/tests
Quality maintained with over 1000 unit and integration tests taking over 4 GPU-hours Get Started To quickly try out h2oGPT with limited document Q/A capability, create a fresh Python 3.10 environment and run:
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
""" According to only the information in the document sources provided within the context above: What is h2oGPT? <|observation|>
7. Gets 3s instant but worse response. I'm not sure why the response is worse but I think it doesn't matter here.
I also noticed that the larger the database is, the longer the delay will be. Database with 1 readme file has 15s delay, while database with 50 word documents (403 chunks) has about 70s delay. I tried another "unsupported" LLM Baichuan2 and got the same result, delay still exists in "RelSources" mode. Again, using a "supported" LLM eliminates the delay.
@ChiNoel-osu I think that's just because the more you have in the context the slower it is. When I tested things, I was uploading OpenAI whisper.pdf from internet. It's always long. If I leave defaults, 10 document chunks will be placed into the context if possible, and that is always slow.
I see, it's their tokenizer that's very slow. I use tokenizer to do various things like check that the input is not too large etc. This usually takes few ms for something like README.md, but here it's taking 10s.
Something seriously must be wrong with their tokenizer in the model.
You can avoid their tokenizer and do instead the below, which uses llama2 tokenizer. It won't be quite right, but at least it won't be slow for RelSources.
python generate.py --base_model=THUDM/chatglm3-6b --tokenizer_base_model=h2oai/h2ogpt-4096-llama2-7b-chat --prompt_type=custom --prompt_dict="{'PreInstruct': '<|user|>', 'PreResponse': '<|observation|>', 'chat_sep': '\n', 'chat_turn_sep': '\n', 'humanstr': '<|user|>', 'botstr': '<|observation|>', 'terminate_response': ['<|user|>']}"
So probably there is some bug in their tokenizer or in HF transformers use of it.
Hmm, some odd bug in transformers, not sure reason. But if I have in h2oGPT:
if tokenizer:
pass
then very slow just to see if tokenizer exists.
But same thing externally isn't too slow, but still slower than I'd imagine. Below takes 0.1s just for "if" part, which is really too slow.
import time
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).half().cuda()
t0 = time.time()
if tokenizer:
pass
print(time.time() - t0)
then very slow.
With the above work-around to some odd transformers bug, it's no longer slow for general use. Thanks for your persistence. I'm not sure what the bug is in transformers that leads to such a simple thing being so slow.
python generate.py --base_model=THUDM/chatglm3-6b --prompt_type=custom --prompt_dict="{'PreInstruct': '<|user|>', 'PreResponse': '<|observation|>', 'chat_sep': '\n', 'chat_turn_sep': '\n', 'humanstr': '<|user|>', 'botstr': '<|observation|>', 'terminate_response': ['<|user|>']}"
Amazing! Updated to https://github.com/h2oai/h2ogpt/commit/9bcac0eab65ba4201ff8708d725d6ac8e676b5f2 and there's no delay anymore. Thank you!
I'm using ChatGLM3-6b as LLM. It works normally in pure LLM mode. When used in doc query mode, it will take a long time to search the document (I believe something is blocking h2oGPT to do so).
Using RelSources as Subset here: stdout when I click Submit:
And it will stuck here for a moment, like 20 or 30 seconds, and the stdout continues:
Despite the "not supported" message, it's able to generate expected results. Just that it'll stuck in the middle for some time. However if I use zephyr-7B-beta (or other "supported" LLMs), the "not supported" message is not there and document query is instant.