opea-project / GenAIExamples

Generative AI Examples is a collection of GenAI examples such as ChatQnA, Copilot, which illustrate the pipeline capabilities of the Open Platform for Enterprise AI (OPEA) project.
https://opea.dev
Apache License 2.0
245 stars 160 forks source link

[Feature] deploy OPEA in small memory of desktop #888

Open RuijingGuo opened 1 week ago

RuijingGuo commented 1 week ago

Priority

P3-Medium

OS type

Ubuntu

Hardware type

AI-PC (Please let us know in description)

Running nodes

Single Node

Description

As AI PC or OPEA developer, I want to deploy OPEA in small memory of desktop (for example, 32G of desktop).

In my case, OPEA deploy failed in 8G, 16G as: tgi: 2024-09-29T02:10:45.234182Z WARN text_generation_launcher: 🚨🚨BREAKING CHANGE in 2.0🚨🚨: Safetensors conversion is disabled without --trust-remote-code because Pickle files are unsafe and can essentially contain remote code execution!Please check for more information here: https://huggingface.co/docs/text-generation-inference/basic_tutorials/safety tgi: 2024-09-29T02:10:45.234250Z WARN text_generation_launcher: No safetensors weights found for model Intel/neural-chat-7b-v3-3 at revision None. Converting PyTorch weights to safetensors. tgi: Error: DownloadError tgi: 2024-09-29T02:10:58.798995Z ERROR download: text_generation_launcher: Download process was signaled to shutdown with signal 9: tgi: 2024-09-29 02:08:04.162 | INFO | text_generation_server.utils.import_utils::75 - Detected system cpu tgi: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. possibly solution: 1) use ollama/llama.cpp and provide example how to use ollama/llama.cpp 2) reduce memory footprint in tgi

eero-t commented 1 week ago

Is this for Kubernetes or plain docker OPEA deployment?

I.e. for which one you are expecting documentation on selecting a model + TGI parameters that fits to given memory amount?