Local LLM! - Githubissues

[Edit 7/20/23]: Let's use Llama 2. AWS / Azure might have hosted versions too, so no local needed.

If there's any ticket I need engagement from the community, it's this one. Adding the ability for users to use their own locally-hosted Large Language Model like Llama, Vicuna, etc.

Current state: OpenAI

Going Premium unlocks OpenAI for (1) summaries & themes; (2) prompt. Summaries & themes without OpenAI are lower accuracy, based on pre-trained huggingface models less sophistication. But they shine for users who prefer isolated compute vs OpenAI. So it's a sort of trade: quality vs transparency. Prompt, on the other hand, is binary - you only get it if using Premium. One reason we charge for OpenAI usage is API costs, obviously. But another reason is the manual step someone goes through to put the site into OpenAI-mode, and there's messaging around what's about to happen, OpenAI's T&S, etc. So it's a gate. FWIW, we're looking into AWS's Titan, and Google's LLM API. AWS is disinterested in user data, so we'd prefer them - but haven't heard a peep for a while on Titan. Google on the other hand, while they have a history with privacy, they also have a history fixing said issues; eg with GDPR. It's all new territory, and we're keeping on top of it.

I want to note: @BirdieLady and I use and trust OpenAI (Prompt) with our journals. If you want my opinion, trust it. But also follow your heart. We find this feature almost as valuable as the whole site itself. It veritably doubled our value of Gnothi, with one single feature. There has been incredible personal depth exploration, new insights, and even actionable take-aways which have bettered our lives by using this feature. So it would be a travesty for it not to be used / explored due to the selection of backend API. So:

Ideal state: OpenAI or BYO LLM

For users who don't want to use OpenAI, but do want Prompt and the added quality of LLM summaries/themes; add the option to use one's own LLM. This would require:

Gnothi-side: add a Webhook field in Settings somewhere, where a user can add their own URL / IP for an LLM API. This will still likely be behind Premium for consistency.
User-side: you'll need to setup a local LLM hosted as an API, where we as community need to figure out an API spec here.
1. User will need a powerful machine. Not necessarily GPU though, llama.cpp was built for CPU inference. Just lost of RAM.
2. User exposes their machine's IP via Ngrok or port-forwarding on their router.
3. Or instead of 1 & 2, user uses a cloud GPU like RunPod, and we can collaborate on a manifest file / Dockerfile for orchestrating this service; and the user just chooses their favorite model.

We'll need to create a Wiki for setting all this up; for a list of recommended models, which models to avoid for wellness purposes, etc.

Hosting tools

Most of the action is happening here https://www.reddit.com/r/LocalLLaMA/. This user recommended exploring Kobold.cpp as a consistent/simple API setup on the users' machine:

Just use the KoboldAI API's generate endpoint, and let people configure their prompt format for the model for the various things you use the AI for, via settings. The KoboldAI generate endpoint is super simple, supports lots of models, and text-generation-webui and kobold.cpp also implement the same API endpoint (though perhaps on a different port), giving users lots of freedom in what software and model they run. Between them, there are tons of hardware acceleration options, too. This endpoint (or an encrypted and authenticated version of it) should just become the standard for local AI as a service, imho.

Alternatives include llama.cpp and text-generation-web-ui. I've personally played with llama.cpp, it wasn't the level of simplicity I'd like to expect of our BYO-model users. There's also a project web-llm which would run the model in the user's browser, but that would require a very strong machine, and I fear it could more easily be misunderstood for its resource requirements (since it's so easy to "turn on") than if someone setup a BYO box at home. If anyone has experience / opinions here, please chime in!

Models

TL:DR: pick something from HF Leaderboard. Keep an eye on Meta's Llama v2.

As for the models themselves. Firstly, our best bet is anything quantized by TheBloke. His various GGML models take the original LLM and "minify" it via quantization to the point where a 7B or 13B parameter model, previously requiring cloud GPUs, can be run on local hardware. We should really be targeting 7B models, as that's the sweet-spot around 16-32gb RAM requirements; 13B models get very taxing.

When I left of my exploration (April 2023), I was poking around the list below. However, ignore this list and instead go to HF Leaderboard. That's constantly changing with SOTA. Also, Meta, whose first model Llama sparked the revolution but which has licensing constraints, will release a v2 which will likely clobber the leaderboard (and without licensing issues). My old list:

Falcon. Seems the best so far, most trusted.
RedPajama
MPT
RWKV. This is an RNN, not a Transformer, meaning it has infinite context length!
I tried and didn't like: StableLM, Dolly 2.0
T5-Flan. Simple, terse. It's not a chat-bot, more a 1-word answerer. Very easily fine-tuned (trained), unlike the others (trains well with few samples, trains fast). See sample training, prompts, model, model docs

Background info:

History of these models
More history, and "We Have No Moat" thing
Another list of models
Licensing. Gnothi decided not to host our own for now, due to licensing issues.

Tasks

[ ] Find a simple DIY model API hosting tool. Explore kobold.cpp
[x] ~~Consider web-llm~~ No-go, too intense requirement in-browser for little payoff. Good for fiddling, not for this.
[ ] Find a good one-size-fits-all model. Explore Falcon.
[ ] Write a Wiki on self-hosting
[ ] Add a Webhook setting for Premium users.

Some notes on tech.

Quantization. you'll see things like ggml-q5_0. GGML is the quantization technique or technology; there are others like GPTQ GGJT, etc. GGML was the preferred one when I researched, I hear there are newer/better techniques. Q means "quantized". A 32-bit floating precision weight in the model is rounded down to its 5 or 4-bit equivalent, which takes up that much less RAM & CPU. The last part _0, _1, etc - I don't know; I just think of it as like the fractional part, higher is better quality. Our target should be q5_0 if using GGML, as q5 is evidently much better quality (it's actually newer tech, they call q4 "legacy").

Parameters. 3b, 7b, 13b, etc. You only see good results 13b and up. Only in the 65b range are we competing with GPT-3.5. No models yet push GPT4, so GPT3 is the current benchmark. Take note of that! If you're ok on the OpenAI privacy stuff, you should just do it - it's cheaper, easier, and significantly higher quality. Ok, so if 65b is ideal quality, why go as low as 7b? It's because 3-13 is the range which can run on consumer hardware. Or even cloud hardware, if not parallelized across multiple machines. 13b is something of a consumer-hardware upper limit, 3b is too strong a quality sacrifice, so 7b is the sweet spot.

So: unless someone knows better with more recent models, ggml-7b-q5_0

I gave koboldcpp a spin, it seems pretty role-play centric. It might be worth looking into KoboldAI (not sure how it's different from koboldcpp, except maybe it's a more one-size solution?). I tried text-generation-webui - it looks really promising. You download a model (eg ggml models from TheBloke) into a folder, choose the model and select the prompt/instruct template from a dropdown (each model has different templates, like ### User: <prompt>\n\n###Assistant: or <|user|><prompt><|bot|><respones><|end|>, so this takes care of it for you. I tried orca-mini-7b.ggmlv3.q2_K.bin, but it didn't deliver. I have a master prompt I always test on these models, based on a dream of mine https://gist.github.com/lefnire/6b0537787175b33d86e9e1a2962af132. It replied in a properly-structured style, but unfortunately it just re-worded the first part (the Singularity template) rather than the second part (the content). I think due to token-size limitation? I want to try falcon-7b-ggml, but it's incompatible with text-generation-webui, so I'd need to try something like lollms-webui. Falcon is high in the leaderboards presently. I also want to try RWKV since it has infinite context-length (it's an RNN).

Anyway, one thing I love about text-generation-webui is it has options for exposing the service as an API, including via ngrok - all through the UI. It also has installers (eg Windows installer), so it's dirt simple for users to setup for their own usage. If anyone has more ideas on models or tools (to host as API), LMK

It looks like Microsoft has partnered with Meta for Llama 2 deployment on Windows & Azure. My thinking is we can get Llama 2 70b on an Azure deployment for most users (premium), so they can choose between Llama vs OpenAI. And free users can proxy to localhost for a WSL2-running local Llama 2 instance (or however Windows enables Llama, I haven't looked into it). This could be a very awesome answer to this ticket, if I'm understanding it correctly.

As for OpenAI models themselves. I just did some digging, and it looks like OpenAI only keeps data submitted for 30 days, and does not use the data to fine-tune. So if 30days of retention is comfortable as long as the data isn't otherwise used, then the concerns here are less than their prior data policy. I should express this more clearly somewhere on the site. HOWEVER, even better - Azure's GPT4 policies indicate NO sort of retention, usage, or anything around submitted data. This could be a significant improvement if I'm understanding it correctly. Would love someone's thoughts if I haven't updated here before.

https://github.com/ocdevel/gnothi/tree/llama2 has Llama2 7b quantized running on Lambda. But ctransformers which runs TheBloke's quantized version depends on GLIBC_2.29, which isn't available in Amazon Linux 2 https://github.com/marella/ctransformers/issues?q=glibc. I tried a custom Dockerfile for the Lambda function extending Amazon Linux 2023, but it's too hard to get a custom Dockerfile to behave like Lambda (passing event & context to the main function), so I just gave up on ctransformers for Lambda.

Incidentally this also rules out ctransformers for SageMaker, since SM uses Amazon Linux 2 as well! So our best bet is AWS Batch with an Alpine / Debian / AL2023 container. Which is fine, I've been needing to get out of Lambda-land for ML inference anyway. I'm gonna table this for now though, and I'm really hoping AWS launches Titan soon enough.

ocdevel / gnothi

Local LLM! #160

Current state: OpenAI

Ideal state: OpenAI or BYO LLM

Hosting tools

Models

Tasks