Closed neviaumi closed 1 month ago
https://huggingface.co/docs/transformers/en/model_doc/phimoe
Also phi3.5 was released, probably adopt that as well.
https://ollama.com/library/zephyr
While phi3.5 was trained with fast-attention, it would request CUDA to be configured.
As i don't have Nvidia , i even don't have Arm MacBook.
Probably escape from that and use something less restricted will be good idea.
After couple of research,
I should go to eather llama.cpp, ollama or vllm
As it have promising performance ( gpt4all
on top of llama.cpp
actually)
ollama is preferred as it less configuration and ready to go.
On the other hand, llama.cpp
i have to get the model gguf version from somewhere and config the server.
The complexity of search or prepare modal to gguf
is considerable
https://medium.com/@naman1011/ollama-vs-vllm-which-tool-handles-ai-models-better-a93345b911e6
While ollama not designed to handle concurrent request.
I should pickup something intended to exposed on production.
https://github.com/ggerganov/llama.cpp/discussions/6730
The thing i can pick may be pop-up here, llama.cpp
, vLLM
and TGI
Check out here for experimental on llama.cpp https://github.com/neviaumi/experimental-llm-agent
I can't get vLLM and TGI working on my computer.
ADR should be written for the decision making as it invoked a lot of research already.
gpt4all wasn't popular package indeed.