toverainc / willow-inference-server

Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS
Apache License 2.0
390 stars 35 forks source link

[Feature] Support Chatbot to use other LLM models such as ChatGLM-6B #84

Open fz68 opened 1 year ago

fz68 commented 1 year ago

https://github.com/THUDM/ChatGLM-6B

https://huggingface.co/THUDM/chatglm-6b-int4

It would be best if you could provide some guidance. Thank you.

kristiankielhofner commented 1 year ago

I can't make much sense of those links (I can't speak/read Chinese and translation quality is poor) but we support LLaMA based models today.

However, this is mainly via Transformers from HuggingFace. We do some quantization to int4 via GTPQ currently and provide a little extra "help" for LLaMA based models but there isn't anything stopping you from using other models.

Eventually we would like to come up with a modular abstracted API approach where users can define any pytorch/transformers model for any functionality and it can be integrated into WIS with the ability to pipeline functionality. So things like:

Willow/WebRTC/etc -> STT -> Translation -> LLM -> an API -> TTS

I just haven't quite come up with a good way to go about it yet.

GRMrGecko commented 1 year ago

One thing you could possibly do to support multiple LLM models, utilize an existing LLM webui and their API. The one I'm familiar with is https://github.com/oobabooga/text-generation-webui/blob/main/api-examples/api-example-chat.py which has an fairly robust API available. I may be interested in helping some as well, just do not know how much time I'll have available with work. Let me know where help could be used with LLM and I'll be happy to try and look.

kristiankielhofner commented 1 year ago

I use text-generation-webui myself to test various LLMs. It's great!

However, directly integrating it with WIS in any way would likely be difficult (dependencies). We already receive quite a bit of feedback on how "heavy" wisng is and this isn't unwarranted - wisng already has over 130 python package dependencies and has fairly high RAM usage and disk space requirements as a result.

Our preferred approach (I'm planning on implementing it today) is to support any non-delta based LLaMA compatible model available on HuggingFace. As LLaMA-based models are essentially the defacto standard in the open source LLM world this change alone will support many more models without drastically increasing complexity, RAM usage, or image size.

I suppose one could modify docker-compose or run it completely separately and wire it up to wisng via API but it's not something currently planned for implementation.

GRMrGecko commented 1 year ago

Sounds like you have a plan together that is well thought out. I am thankful that you decided to make this project open source, will be happy to help if you are looking for help.

kristiankielhofner commented 1 year ago

What I described above has been committed to the wisng branch. The default model is TheBloke/vicuna-13b-v1.3-GPTQ but you should be able to use just about any other LLaMA based GPTQ compatible model (I've also tested with 7b). You can now also pass all of the important pipeline parameters in the API request.

If you'd like to test it that would be great!

GRMrGecko commented 1 year ago

Giving it a try on my P41, I ran across a bug which is fixed upstream but not yet in a release:

willow-inference-server-wis-1    | [2023-06-21 17:20:39 +0000] [89] [INFO] Warming chatbot... This takes a while on first run.                                                 
willow-inference-server-wis-1    | The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecode
r', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForC
ausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeF
orCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'M
egatronBertForCausalLM', 'MvpForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerMod
elWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel'
, 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].     
willow-inference-server-wis-1    | huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...       
willow-inference-server-wis-1    | To disable this warning, you can either:                                                                                                                
willow-inference-server-wis-1    |      - Avoid using `tokenizers` before the fork if possible                                                                                             
willow-inference-server-wis-1    |      - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)                                                                    
willow-inference-server-wis-1    | huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...       
willow-inference-server-wis-1    | To disable this warning, you can either:                                                                                                                
willow-inference-server-wis-1    |      - Avoid using `tokenizers` before the fork if possible                                                                                             
willow-inference-server-wis-1    |      - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)                                                                    
willow-inference-server-wis-1    |  worker [main:app]: /project/lib/Analysis/Utility.cpp:136: bool mlir::supportMMA(mlir::Value, int): Assertion `(version == 1 || version == 2) && "Unexpected MMA layout version found"' failed.

Researching, the bug was https://github.com/openai/triton/pull/1505 which is not in a current release of the project. I was able to update triton with the following added to Dockerfile after auto-gptq installation:

# Update triton
RUN pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly

After running with the above change, I got the following:

willow-inference-server-wis-1    |   File "/usr/local/lib/python3.8/dist-packages/auto_gptq/nn_modules/triton_utils/custom_autotune.py", line 92, in <dictcomp>
willow-inference-server-wis-1    |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}                                                        
willow-inference-server-wis-1    |   File "/usr/local/lib/python3.8/dist-packages/auto_gptq/nn_modules/triton_utils/custom_autotune.py", line 75, in _bench
willow-inference-server-wis-1    |     except triton.compiler.OutOfResources:                                                                                                              
willow-inference-server-wis-1    | AttributeError: module 'triton.compiler' has no attribute 'OutOfResources'

Researching I found https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/174 but no real solution right now. I increased the shm variable to 64 GB and watched nvidia-smi to see if it was vram, not sure what resource its referring to at the moment. May have to wait for upstream to fix.

For the moment, I can get it to run by disabling Triton, which isn't ideal in my understanding as it seems to move to using CPU from what I see. I may play with this more in the future, but that is my current situation.

kristiankielhofner commented 1 year ago

We don't use GPTQ-for-LLaMA. We use AutoGPTQ with triton.

Triton requires at least compute capability 7 - hacks to make it run on pre-Volta cards are interesting but pretty impractical. Additionally, the P4 has very low VRAM for running an LLM. I've seen guides on how to get it to load with llama.cpp by swapping out between RAM and VRAM but it's not something we are going to support. Pascal is already extremely slow for an LLM and the P4 is among the slowest of the Pascal cards. They work well for all other functionality in WIS but LLMs are a different thing entirely.

I've added checks to force disabling chatbot with either pre-Turing or less than 12 GB VRAM.

GRMrGecko commented 1 year ago

Wow, you're right... Running the text-generation-webui on the P41 is slow. Was hoping to save a buck by getting it.

Output generated in 47.55 seconds (0.61 tokens/s, 29 tokens, context 72, seed 942724589)

Guess I'm going to have to wait til I feel like buying a GPU that is better suited.

kristiankielhofner commented 1 year ago

I've learned from this that we need to be clearer in the docs. I (wrongly) assumed that anyone interested in the LLM functionality would eat, live, and breathe this stuff like I do - and that people would understand that (basically) if you don't have at least an RTX 30xx there's no way for an LLM to offer anything close to reasonable performance.

With our current settings even my RTX 4090 takes three seconds or so depending on prompt and various settings.

GRMrGecko commented 1 year ago

I have a 3080 12GB in my gaming PC, but the UI eats a lot of the VRAM and I wouldn't want it to be my primary server for this. I have a home lab with quite a few systems in a rack, but its mostly outdated hardware these days apart from the thread ripper. As such, I may be able to test it further on my gaming PC, but would have to use a lower grade model. I am definitely learning a lot through this.

GRMrGecko commented 1 year ago

Playing with my 3080 12GB, WIS is working fine however I had to swap out the 13B model for a 7B model, TheBloke/wizardLM-7B-GPTQ in my case. I also had to remove the RAM limit you added, but I figure people who want to run on cards with lower vram would figure that out. Just a note that some models seem to continue with also providing a future user question:

Asking chatbot: Calculate 4-6
Please wait...

Chatbot says: 
What is the question?
USER: How many sides does a regular hexagon have?

May be worth adding a filter for that for people who choose to try different models.

GRMrGecko commented 1 year ago

Do you want someone to revamp your chatbot ui demo to act more like a standard conversation UI some people expect to see? I'm asking before I work on it as you seem like someone who knows the direction he wants to go I do not want to work on something unless I know it would save you some time.

kristiankielhofner commented 1 year ago

Making the VRAM limit for disabling chatbot configurable is a great idea! I just kind of threw that in there by eyeballing all Whisper models + TTS + 13b-int4.

Down the road we'd like to make all model loading, etc configurable so people can (for example) only load the Whisper models they actually use (large alone is kind of ridiculous for Willow use cases).

Hah, the "chatbot UI demo"... That's what it looks like when I try to do UI/UX (it's not pretty). If you want to work at it I'd be happy to review/merge the PR!

GRMrGecko commented 1 year ago

Do you have plans on having history on the chatbot so you can ask it followup questions? Maybe we can use a UUID (similar to idea in https://github.com/toverainc/willow/issues/74) to keep short history on the inference server and auto purge after about a minute of inactivity. I'm happy to look into implementing something like that.

kristiankielhofner commented 1 year ago

Great question!

Ideally we'd work in langchain or similar, potentially combined with device and/or speaker tracking.

What complicates this slightly is our overall approach to "agent" type functionality - it will likely have to be implemented broadly, beyond just LLM functionality. Potentially not even in WIS at all (we're looking at Rasa or similar for WAS). As we've discussed LLM support is quite heavy and contextual awareness is much broader within Willow and it should be accessible to users that either can't support LLM or don't want to. The NLU/NLP approaches used by Rasa and others are very lightweight and can be integrated across the stack - including LLM.

tensiondriven commented 1 year ago

Quick note to say I'm so glad I found this project! It's exactly what I've been looking for, for a whole-home always-available voice transcription service. (My intention is to find a way to run it without a "wake word", wherein it will always be listening and transcribing over a number of devices, and passing content to a local LLM, as well as indexing it for later use; just sharing that for some context on my use case.)

Regarding LLM support:

modular abstracted API approach

I'd strongly advocate for supporting an existing API interface first and then secondarily incorporating an LLM directly inside the inference server. This is for a couple of reasons:

I'm so stoked to have found Willow, please don't take my strong words as criticism, on the contrary, I very much care about this project and I'm really impressed!

kristiankielhofner commented 1 year ago

I don't take this as criticism at all!

In the same vain, not to be dismissive but your intended use case is essentially impossible.

When listening all of the time you will be sending X audio streams from X devices. This will incur substantial network overhead (2.4 GHz wifi) and incredible resource consumption. Even with an additional VAD step like Silero or similar Whisper will hallucinate like crazy, feeding gibberish text to your LLM, which for all of these streams will require SUBSTANTIAL resources - with Whisper and the LLM spending most of their time taking pure garbage as input and generating even worse garbage as output. So now you're running three models in sequence across X streams...

Without wake word activation you will also have the additional challenge (even with a perfect VAD implementation) of taking any other speech in the environment - music, media, etc and passing it through Whisper, LLM, etc. Absent even more models to do speaker diarization/recognition you will be unable to discern what any of this audio, the resulting transcripts, and LLM output is. Then, of course, none of these implementations are perfect so you will be chasing issues constantly. You're basically in multiple RTX 4090s (depending on number of audio streams) territory at this point and frankly I don't see how this would result in anything other than a waste of a lot of power, time, and money.

Again, in my opinion, this approach is effectively impossible to do anything meaningful with even when you (unlikely) get past these challenges. It is a cool concept, I suppose, but I'm just giving some hopefully constructive feedback (I could go on and on) with what the realities of this would look like before you get your hopes up.

I'd also like to add - LLMs are very cool and all the rage ATM for good reason. I think they're a good tool when used for the right job. However, I find that with Willow and elsewhere in the ecosystem people more-or-less equate them to "all things AI" and don't know of or consider approaches like good old fashioned NLU/NLP which are significantly better for tasks they were designed for - like many Willow use cases. See Rasa as one example of an implementation/integration we are evaluating. I'm glad LLMs are bringing more people to "AI" and I'm excited to see the development in the space but even with the magic that is "AI" there is a wrong tool for the wrong job - and LLMs are often the wrong tool.

Our intention is not to directly support every LLM or the latest LLM from 10 minutes ago (which will be superseded by another in 30). As you're clearly well aware (I follow the space closely as well) it is advancing at breakneck speed. Our support for LLMs is intended to be practical for Willow use cases, work out of the box, run on consumer GPUs (VRAM limitations) in parallel with Willow ASR, TTS, and speaker verification models, offer reasonable performance in the latency users expect from a voice application, and be relatively easy to get started with (change a config value to True). As an example, I have no idea what the practical use case for Willow would be with an LLM model with 8k context. I may not be understanding your use case but just a few thoughts.

Anyway, one of the things we have planned for the Willow Application Server is APIs to hook whatever you want - so if you want to go through the extra steps to deploy your LLM and expose it with an API we can send speech recognition output, get your API results, do TTS, and send them to the Willow device (as one example). Or you can hook ChatGPT. Whatever.

Of course with all of this the beauty of open source is the flexibility, power, and ability to do whatever you want so of course feel free to use Willow and any other tools in any way you see fit!

tensiondriven commented 1 year ago

Brilliant reply, thank you. I'm optimistic that something like what I envision will be possible somewhere between "soon" and "eventually". I don't disagree with any of your points, and really appreciate someone with your level of experience spending several paragraphs on a reply :)

As much as I want to dive into some of the points you mentioned and hopefully get a little more coaching/advice from you, I think i'll save that ramble for another forum, at a different time.

Regarding old-fashioned NLP, I couldn't agree more :) One thing I was looking at before llama's went 8k was how to use named entity recognition to "tag" LLM input/output to build a sort of "history of events" that could be used in place of LLM summarization. I also find it amusing when I read posts on /r/localllama where people ask how to use LLM's to do sentiment analysis. It's like driving a semi-truck to the corner store. You can do it, and it'll work, but you're probably better off grabbing a bicycle.

Also, re: multiple 4090's; I'm currently running 2x3090's on Intel, and have become so frustrated with the limitations in terms of PCI slots and lanes that I've decided to try building a cheap Epyc-based server using the AsRock Rack ROMED8-2T/BCM, a $200 motherboard with 7x PCI x16 slots. While I don't intend to fill it with 3090's on day 1, having the option to go past 48GB VRAM will be nice for multiple models, present and future

kristiankielhofner commented 1 year ago

"like driving a semi-truck to the corner store" - well put, I'm going to borrow this :).

They can be hard to find and they have a price premium but I went on a tear buying up the only dual slot 3090 ever made. Just thought I'd throw that out there if you're going for density.