toverainc / willow-inference-server

Open source, local, and self-hosted highly optimized language inference server supporting ASR/STT, TTS, and LLM across WebRTC, REST, and WS
Apache License 2.0
368 stars 31 forks source link

current versions of grpcio/protobuf crash chatbot/utils.sh install 13B #99

Closed hamishcunningham closed 1 year ago

hamishcunningham commented 1 year ago

In both main and wisng there's a version conflict in requirements.txt that triggers this error on cd chatbot && ./utilts.sh install 13B:

Saving a LlamaTokenizerFast to llama-13B-hf.
TypeError: Descriptors cannot not be created directly.

A workaround is to do this from within the container shell:

pip install grpcio-status==1.33.2 protobuf==3.19.6

See also https://github.com/huggingface/transformers/issues/21128#issuecomment-1384031722

kristiankielhofner commented 1 year ago

In wisng everything in chatbot/ is no longer required and should be removed (just did) - it hasn't been required or updated in a long time so there's no telling where this is coming from there. The issue you referenced seems to (from a quick read) be related to Tensorflow.

If you want to use LLM support you can copy settings.py to custom_settings.py and set support_chatbot: True. You can change the chatbot model path and basename to a local path or anything from HuggingFace that's GPTQ LLaMa. As you can tell we have a LOT of documentation updates to do :).

Generally I'm not surprised by this - protobuf and grpcio have long been the source of my biggest dependency problems with python in this space but let us know if you continue to run into this anywhere.

hamishcunningham commented 1 year ago

Tnx @kristiankielhofner ! I tried this in custom_settings.py

    # Path to chatbot model - download from HuggingFace at runtime by default (gets cached)
#   chatbot_model_path: str = 'TheBloke/vicuna-13b-v1.3-GPTQ'
    chatbot_model_path: str = 'models/vicuna'

    # Chatbot model basename
#   chatbot_model_basename: str = 'vicuna-13b-v1.3-GPTQ-4bit-128g.no-act.order'
    chatbot_model_basename: str = 'gptq_model-4bit-128g' 

and got these warnings:

willow-inference-server-wis-1    | [2023-07-04 07:41:00 +0000] [90] [INFO] CHATBOT: Using model models/vicuna and CUDA, attempting load (this takes a while)...
willow-inference-server-wis-1    | The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
willow-inference-server-wis-1    | The safetensors archive passed at models/vicuna/gptq_model-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
willow-inference-server-wis-1    | [2023-07-04 07:41:29 +0000] [90] [INFO] Warming models...
willow-inference-server-wis-1    | [2023-07-04 07:41:34 +0000] [90] [INFO] Warming chatbot... This takes a while on first run.
willow-inference-server-wis-1    | Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
willow-inference-server-wis-1    | pip install xformers.
willow-inference-server-wis-1    | The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
willow-inference-server-wis-1    | huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
willow-inference-server-wis-1    | To disable this warning, you can either:
willow-inference-server-wis-1    |  - Avoid using `tokenizers` before the fork if possible
willow-inference-server-wis-1    |  - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
willow-inference-server-wis-1    | huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
willow-inference-server-wis-1    | To disable this warning, you can either:
willow-inference-server-wis-1    |  - Avoid using `tokenizers` before the fork if possible
willow-inference-server-wis-1    |  - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
willow-inference-server-wis-1    | huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...

Anything I should try and fix? And what would I expect it to do if it is working? Tnx!

hamishcunningham commented 1 year ago

I found https://localhost:19000/chatbot/ :) Closing this as solved, tnx!