ChiNoel-osu commented 10 months ago

I'm using ChatGLM3-6b as LLM. It works normally in pure LLM mode. When used in doc query mode, it will take a long time to search the document (I believe something is blocking h2oGPT to do so).

Using RelSources as Subset here: stdout when I click Submit:

prompt: <|system|>
Sys</s>
<|user|>
Hello there</s>
<|assistant|>

The model 'ChatGLMForConditionalGeneration' is not supported for . Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'WhisperForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
prompt: <|system|>
Sys</s>
<|user|>
Hello there</s>
<|assistant|>

sim_search in 0.09814214706420898
prompt: <|system|>
Sys</s>
<|user|>
Pay attention and remember the information below, which will help to answer the question or imperative after the context ends.
"""

"""
According to only the information in the document sources provided within the context above: Hello there</s>
<|assistant|>

And it will stuck here for a moment, like 20 or 30 seconds, and the stdout continues:

prompt: <|system|>
Sys</s>
<|user|>
Hello there</s>
<|assistant|>

The model 'ChatGLMForConditionalGeneration' is not supported for . Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FuyuForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'WhisperForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
Distance: min: 0.62374347448349 max: 0.5229793787002563 mean: 0.6232390850782394 median: 0.6266895830631256
query: Hello there
answer: ****ACTUAL ANSWER WAS GENERATED HERE****

Despite the "not supported" message, it's able to generate expected results. Just that it'll stuck in the middle for some time. However if I use zephyr-7B-beta (or other "supported" LLMs), the "not supported" message is not there and document query is instant.

ChiNoel-osu commented 10 months ago

Commit: https://github.com/h2oai/h2ogpt/commit/635cefdd2634845e743000e09e47f88c9c23056e Env: Linux full install

pseudotensor commented 10 months ago

First, this is the prompt structure according to their chat code. Custom prompt used here:

python generate.py --base_model=THUDM/chatglm3-6b --prompt_type=custom --prompt_dict="{'PreInstruct': '<|user|>', 'PreResponse': '<|observation|>', 'chat_sep': '\n', 'chat_turn_sep': '\n', 'humanstr': '<|user|>', 'botstr': '<|observation|>', 'terminate_response': ['<|user|>']}"

following: https://github.com/h2oai/h2ogpt/blob/main/docs/FAQ.md#adding-prompt-templates

Chat works fast like you said:

But Q/A is slow to start, yes. I see same thing.

I did install their package:

pip install cpm_kernels

So I suspect that their kernels are just bad, nothing to do with h2oGPT etc.

pseudotensor commented 10 months ago

Basically model is messed up when context is filled some, over-use of CPU etc.

ChiNoel-osu commented 10 months ago

Thanks for the info. I did notice max CPU usage during the "stuck" phrase. But this didn't happen with another RAG app Langchain-Chatchat. It uses FastChat though, don't know if that's the difference.

pseudotensor commented 10 months ago

Maybe probably limit the use of context even more to avoid the slowness. if that's true, you can do same by passing --max_input_tokens to something smaller in CLI/UI/API. Or just set --max_seq_len to smaller. Or set --top_k_docs=3 (i.e. smaller).

There's nothing h2oGPT does itself, it's just transformers like the code block there:

https://huggingface.co/THUDM/chatglm3-6b#%E4%BB%A3%E7%A0%81%E8%B0%83%E7%94%A8-code-usage

i.e. you should be able to use the same code without h2oGPT but pass in a large text for context, ask some question like summarize, and it'll behave same.

If you try and see it fast, let me know.

ChiNoel-osu commented 10 months ago

All test below is conducted with h2oGPT and ChatGLM3-6B LLM

Doc query mode, custom prompt.

I submitted a simple query, it gets stuck at the start for a moment and then finally produces expected result (accuracy aside).

Pure LLM mode, plain prompt.

I went to stdout and copied the same final prompt from the last test, and then I submit that directly. It also gives expected result. There was no "stuck", answer was instant, just like normal chat mode.

Therefore I think the model is working fine and has no problem with large context.

pseudotensor commented 10 months ago

Hi, not sure what you are copy pasting, and seems like small context, not long document.

If you can give specific reproducible example I can check, but screen shots don't allow that. Thanks.

ChiNoel-osu commented 10 months ago

If we use "RelSources" which I believe only shows the relative document and doesn't actually involve LLM processing. There's still the same delay of about 15 seconds at the start.

Also I've tested:

Upload h2oGPT's README.md to database (instructor-large).
Ask "What is h2oGPT?".
Gets 15s delayed but good response.
Switch to pure LLM mode, and change prompt type to "plain".

Copy the final prompt from console and paste it directly to the chat box. In this case it's like this:


<|user|>Pay attention and remember the information below, which will help to answer the question or imperative after the context ends.
"""
h2oGPT
Turn ★ into ⭐ (top-right corner) if you like the project!
Query and summarize your documents or just chat with local private GPT LLMs using h2oGPT, an Apache V2 open-source project.
Private offline database of any documents (PDFs, Excel, Word, Images, Video Frames, Youtube, Audio, Code, Text, MarkDown, etc.)
Persistent database (Chroma, Weaviate, or in-memory FAISS) using accurate embeddings (instructor-large, all-MiniLM-L6-v2, etc.)

Any CLI argument from python generate.py --help with environment variable set as h2ogpt_x, e.g. h2ogpt_h2ocolors to False. Set env h2ogpt_server_name to actual IP address for LAN to see app, e.g. h2ogpt_server_name to 192.168.1.172 and allow access through firewall if have Windows Defender activated. One can tweak installed h2oGPT code at, e.g. C:\Users\pseud\AppData\Local\Programs\h2oGPT. To terminate the app, go to System Tab and click Admin and click Shutdown h2oGPT.

If startup fails, run as console and check for errors, e.g. and kill any old Python processes. Full Windows 10/11 Manual Installation Script Single .bat file for installation (if do not skip any optional packages, takes about 9GB filled on disk). Recommend base Conda env, which allows for DocTR that requires pygobject that has otherwise no support (except mysys2 that cannot be used by h2oGPT). Also allows for TTS package by Coqui, which is otherwise not enabled currently in one-click installer.

For any platform, some packages download models at runtime, like for DocTR, Unstructured, BLIP, Stable Diffusion, etc. that appear to delay operations in the UI. The progress appears in the console logs. Windows 10/11 64-bit with full document Q/A capability One-Click Installer CPU or GPU: Download h2oGPT Windows Installer (1.3GB file)

H2O.ai have built several world-class Machine Learning, Deep Learning and AI platforms:

1 open-source machine learning platform for the enterprise

H2O-3
The world's best AutoML (Automatic Machine Learning) with H2O Driverless AI
No-Code Deep Learning with H2O Hydrogen Torch
Document Processing with Deep Learning in Document AI We also built platforms for deployment and monitoring, and for data wrangling and governance:
H2O MLOps to deploy and monitor models at scale

These one-click installers are experimental. Report any issues with steps to reproduce at https://github.com/h2oai/h2ogpt/issues. Note: The app bundle is unsigned. If you experience any issues with running the app, run the following commands: bash $ xattr -dr com.apple.quarantine {file-path}/h2ogpt-osx-m1-gpu $ chmod +x {file-path}/h2ogpt-osx-m1-gpu

macOS Manual Install and Run Docs Example Models

To create a development environment for training and generation, follow the installation instructions. To fine-tune any LLM models on your data, follow the fine-tuning instructions. To run h2oGPT tests: bash pip install requirements-parser pytest-instafail pytest-random-order pip install playsound==1.3.0 pytest --instafail -s -v tests

for client tests

make -C client setup make -C client build pytest --instafail -s -v client/tests

for openai server test on already-running local server

Quality maintained with over 1000 unit and integration tests taking over 4 GPU-hours Get Started To quickly try out h2oGPT with limited document Q/A capability, create a fresh Python 3.10 environment and run:

CPU or MAC (M1/M2): bash
for windows/mac use "set" or relevant environment setting mechanism

export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"
Linux/Windows CUDA: bash
for windows/mac use "set" or relevant environment setting mechanism

""" According to only the information in the document sources provided within the context above: What is h2oGPT? <|observation|>
```
7. Gets 3s instant but worse response. I'm not sure why the response is worse but I think it doesn't matter here.
```

ChiNoel-osu commented 10 months ago

I also noticed that the larger the database is, the longer the delay will be. Database with 1 readme file has 15s delay, while database with 50 word documents (403 chunks) has about 70s delay. I tried another "unsupported" LLM Baichuan2 and got the same result, delay still exists in "RelSources" mode. Again, using a "supported" LLM eliminates the delay.

pseudotensor commented 10 months ago

@ChiNoel-osu I think that's just because the more you have in the context the slower it is. When I tested things, I was uploading OpenAI whisper.pdf from internet. It's always long. If I leave defaults, 10 document chunks will be placed into the context if possible, and that is always slow.

pseudotensor commented 10 months ago

I see, it's their tokenizer that's very slow. I use tokenizer to do various things like check that the input is not too large etc. This usually takes few ms for something like README.md, but here it's taking 10s.

Something seriously must be wrong with their tokenizer in the model.

You can avoid their tokenizer and do instead the below, which uses llama2 tokenizer. It won't be quite right, but at least it won't be slow for RelSources.

python generate.py --base_model=THUDM/chatglm3-6b --tokenizer_base_model=h2oai/h2ogpt-4096-llama2-7b-chat --prompt_type=custom --prompt_dict="{'PreInstruct': '<|user|>', 'PreResponse': '<|observation|>', 'chat_sep': '\n', 'chat_turn_sep': '\n', 'humanstr': '<|user|>', 'botstr': '<|observation|>', 'terminate_response': ['<|user|>']}"

So probably there is some bug in their tokenizer or in HF transformers use of it.

pseudotensor commented 10 months ago

Hmm, some odd bug in transformers, not sure reason. But if I have in h2oGPT:

if tokenizer:
    pass

then very slow just to see if tokenizer exists.

But same thing externally isn't too slow, but still slower than I'd imagine. Below takes 0.1s just for "if" part, which is really too slow.

import time

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True).half().cuda()

t0 = time.time()
if tokenizer:
    pass
print(time.time() - t0)

then very slow.

pseudotensor commented 10 months ago

With the above work-around to some odd transformers bug, it's no longer slow for general use. Thanks for your persistence. I'm not sure what the bug is in transformers that leads to such a simple thing being so slow.

python generate.py --base_model=THUDM/chatglm3-6b --prompt_type=custom --prompt_dict="{'PreInstruct': '<|user|>', 'PreResponse': '<|observation|>', 'chat_sep': '\n', 'chat_turn_sep': '\n', 'humanstr': '<|user|>', 'botstr': '<|observation|>', 'terminate_response': ['<|user|>']}"

pseudotensor commented 10 months ago

https://github.com/huggingface/transformers/issues/28456

ChiNoel-osu commented 10 months ago

Amazing! Updated to https://github.com/h2oai/h2ogpt/commit/9bcac0eab65ba4201ff8708d725d6ac8e676b5f2 and there's no delay anymore. Thank you!

h2oai / h2ogpt

Searching documents gets "stuck" for a moment if used "unsupported" LLM. #1276

Doc query mode, custom prompt.

Pure LLM mode, plain prompt.

1 open-source machine learning platform for the enterprise

for client tests

for openai server test on already-running local server

for windows/mac use "set" or relevant environment setting mechanism

for windows/mac use "set" or relevant environment setting mechanism