Let's collaborate - Githubissues

[apologies for early send, accidentally hit enter]

Hey there! Turns out we think on extremely similar wavelengths - I did the exact same thing as you, for the exact same reasons (libraryification), and through the use of similar abstractions: https://github.com/philpax/ggllama

Couple of differences I spotted on my quick perusal:

My version builds on both Windows and Linux, but fails to infer correctly past the first round. Windows performance is also pretty crappy because ggml doesn't support multithreading on Windows.
I use PhantomData with the Tensors to prevent them outliving the Context they're spawned from.
I vendored llama.cpp in so that I could track it more directly and use its ggml.c/h, and to make it obvious which version I was porting.

Given yours actually works, I think that it's more promising :p

What are your immediate plans, and what do you want people to help you out with? My plan was to get it working, then librarify it, make a standalone Discord bot with it as a showcase, and then investigate using a Rust-native solution for the tensor manipulation (burn, ndarray, arrayfire, etc) to free it from the ggml dependency.

Hi! Thanks for your post and for all the help so far :smile:

Turns out we think on extremely similar wavelengths

Glad to hear I'm not the only one who saw the potential in this project! I think having the potential to build something this huge and making it a CLI app is not aiming high enough, heh.

Windows performance is also pretty crappy because ggml doesn't support multithreading on Windows.

To be fair, I haven't even tested this on Windows. Maybe it builds just fine. But I didn't know what flags to set to compile with AVX and inference times without AVX are pretty bad (if you're telling me there's no multithreading on top of that, it's probably gonna be unusably slow anyway, unfortunately). I don't have a Windows machine to test this, so I didn't want to promise support for untested systems :sweat_smile:

I use PhantomData with the Tensors to prevent them outliving the Context they're spawned from.

I thought about this! Even started with this design. But didn't want to force all the code to use a 'ctx lifetime annotation, since those typically become infectious and are a bit unergonomic. So I did the Arc / Weak thing instead, since the performance of cloning or operating with tensor pointers doesn't really matter. Still, seeing how it turned out in the end, I think the lifetime annotation wouldn't have been so bad :thinking:.

That said, the "bindings" in ggml.rs do not aim to be a safe abstraction, at least not in their current state! We would need to put a bit more thought into it, because the ownership model is a bit weird there. A tensor is tied to a context, but you can then use tensors from one context to operate with tensors from another context, which makes a graph computation on one context access data from other contexts. So not even lifetime annotations would help here.

What are your immediate plans, and what do you want people to help you out with?

Very good question! So far I was most concerned with whether I could do it :rofl: But now that the library exists, I'm thinking it would be pretty good to start improving this. A few things off the top of my head:

Library-fication: This needs to happen ASAP. Split the current llama-rs crate into two crates, llama-rs would be a library, and llama-rs-cli would be the simple example CLI app we have now. I don't have much interest in making the CLI experience better (porting things like the interactive mode or terminal colors from llama.cpp), but they're welcome in case someone wants to contribute.
Add a server mode, perhaps as an addition to llama-rs-cli that would allow spawning a long-running process that can serve multiple queries. The current usage model doesn't make any sense. You spend a lot of time loading the models from disk (especially if you're using the larger ones) only to throw all that away after a single prompt generation.
Prompt caching: Another thing that doesn't make sense with the current model is when you have a huge prompt, because you need to feed it through the network every time you want to do inference with that "pre-prompt" + some user input. This is described in an issue in llama.cpp (currently unimplemented). If I understood correctly, the gist of it is that we need to dump the contents of the memory_k and memory_v tensors to disk, and load them back, and that would be the same as feeding the model the same prompt again. Choosing a fast compression algorithm would be a good way to mitigate the cost of storing massive tensors on disk.
Another thing I have on my radar is this famous "GPTQ 4bit" quantization. It is well known (I mean, you just need to run a trivial example) that the 4-bit quantization in llama.cpp affects the results of the network. If this GPT4 quantization is capable of keeping the same quality as the f16 version as some sources claim, this would be huge. But I would need to investigate more to be sure.

investigate using a Rust-native solution for the tensor manipulation (burn, ndarray, arrayfire, etc) to free it from the ggml dependency

Yup, also considered that, I'd love to do this if possible :) Not sure how long it would take, but none of the tensor operations I ported seem too complicated. The code should be pretty straightforward to port to a different library, and I'm sure some of the Rust options achieve a more ergonomic (and safe!) API. We just have to keep an eye on performance, but having an already working ggml version means we can just benchmark.

If that change also helps us support GPU inference, that'd be pretty cool. But I don't want to add GPU support if that means people having to mess with CUDA or rocm drivers and report all sorts of issues. Unless it's something portable that works out of the box for people, I'm not interested.

I'm not sure how crazy it would be to build a tensor library on top of wgpu compute shaders. Just throwing that out there for anyone who feels crazy and/or adventurous enough :thinking: But that would eventually mean tensors on wasm, which is pretty darn cool I guess?

Anyway, I'm happy to have someone else on board :smile: If there's anything I mentioned above you'd like to take the lead on, please say so! I'm not going to be making any big changes to the code in a few days.

Just something to follow w.r.t to GPTQ quantization :eyes: https://github.com/ggerganov/llama.cpp/issues/9

Also count me in for any future work! I've been obsessed with llama for the past few weeks and getting a solid Rust implementation of a modern machine learning model like this is really impressive. I might try tackling breaking the app into a library in the next few days (unless someone else beats me to it :smile:)

To be fair, I haven't even tested this on Windows. Maybe it builds just fine. But I didn't know what flags to set to compile with AVX and inference times without AVX are pretty bad (if you're telling me there's no multithreading on top of that, it's probably gonna be unusably slow anyway, unfortunately). I don't have a Windows machine to test this, so I didn't want to promise support for untested systems 😅

I've been testing on Windows with my patches applied and it seems to work fine. It's probably not as fast as it could be, but it's plenty fast enough!

I thought about this! Even started with this design. But didn't want to force all the code to use a 'ctx lifetime annotation, since those typically become infectious and are a bit unergonomic. So I did the Arc / Weak thing instead, since the performance of cloning or operating with tensor pointers doesn't really matter. Still, seeing how it turned out in the end, I think the lifetime annotation wouldn't have been so bad 🤔.

Yeah, I actually went back and forth on this. I started without any kind of checking, promptly got owned by accessing freed memory, fixed that, and then bolted on the PhantomData afterwards. Turns out it's not too bad, as you've noticed, because the only place where the actual borrows come up is LlamaModel with reference to the context that's created during the loading process. That's a little annoying, but I worked around it by splitting the load into two so that the model could borrow from the separately-stored context:

https://github.com/philpax/ggllama/blob/7b69eb984dc32f8bcd199eb75484c33f24f9ec1f/src/llama.rs#L157-L169

Ideally, we'd still maintain the same LlamaModel interface to the outside world - but it might get annoyingly self-referential. Will play around with it sometime!

That said, the "bindings" in ggml.rs do not aim to be a safe abstraction, at least not in their current state! We would need to put a bit more thought into it, because the ownership model is a bit weird there. A tensor is tied to a context, but you can then use tensors from one context to operate with tensors from another context, which makes a graph computation on one context access data from other contexts. So not even lifetime annotations would help here.

Huh, you're right - hadn't even thought about that. That's... pretty gnarly. I wonder if it's possible for the operands in a binary operation A + B = C to each have their own lifetimes, such that 'a A + 'b = 'c C where 'a and 'b outlive ''c? I must admit I've never really delved into that kind of lifetime trickery!

Library-fication: This needs to happen ASAP. Split the current llama-rs crate into two crates, llama-rs would be a library, and llama-rs-cli would be the simple example CLI app we have now. I don't have much interest in making the CLI experience better (porting things like the interactive mode or terminal colors from llama.cpp), but they're welcome in case someone wants to contribute.

Well... I'd actually started on this before you replied 😂 Here's the PR. I wrote a Discord bot to prove that it works, too. Hell of a thing to run a binary and be able to run a LLM with friends with an evening's work!

Add a server mode, perhaps as an addition to llama-rs-cli that would allow spawning a long-running process that can serve multiple queries. The current usage model doesn't make any sense. You spend a lot of time loading the models from disk (especially if you're using the larger ones) only to throw all that away after a single prompt generation.

Yeah, I've also thought about this. Seems easy enough to do; I'd do it as a separate application just to keep the concerns separate, and to offer up a simple "API server" that anyone can run. My closest point of comparison is the API for the Automatic1111 Stable Diffusion web UI - it's not the best API, but it does prove that all you need to do is offer up a HTTP interface and They Will Come:tm:.

Prompt caching: Another thing that doesn't make sense with the current model is when you have a huge prompt, because you need to feed it through the network every time you want to do inference with that "pre-prompt" + some user input. This is described in an issue in llama.cpp (currently unimplemented). If I understood correctly, the gist of it is that we need to dump the contents of the memory_k and memory_v tensors to disk, and load them back, and that would be the same as feeding the model the same prompt again. Choosing a fast compression algorithm would be a good way to mitigate the cost of storing massive tensors on disk.

I think this could be exposed through the API, but it's not necessarily something that should be part of the API by default. I'd break apart the inference_with_prompt function into easy-to-manipulate steps, so that users could save the state of the LLM at any given moment, and make that easy to do.

That being said, that sounds pretty reasonable to do for both the CLI and/or the API. Either/or could serve as a "batteries-included" example of how to ship something that's consistently fast with the library.

Another thing I have on my radar is this famous "GPTQ 4bit" quantization. It is well known (I mean, you just need to run a trivial example) that the 4-bit quantization in llama.cpp affects the results of the network. If this GPT4 quantization is capable of keeping the same quality as the f16 version as some sources claim, this would be huge. But I would need to investigate more to be sure.

Oh yeah, it's pretty cool. I haven't played around with it much myself, but the folks over at the text-generation-webui have used it to get LLaMA 30B into 24GB VRAM without much quality loss. Seems like it's something that upstream is looking at, though, so I'm content to wait and see what they do first.

Yup, also considered that, I'd love to do this if possible :) Not sure how long it would take, but none of the tensor operations I ported seem too complicated. The code should be pretty straightforward to port to a different library, and I'm sure some of the Rust options achieve a more ergonomic (and safe!) API. We just have to keep an eye on performance, but having an already working ggml version means we can just benchmark.

Yeah, I think most of the existing Rust ML libraries should be able to handle this. I was surprised at how few operations it used while porting it myself! It's certainly much simpler than Stable Diffusion.

If that change also helps us support GPU inference, that'd be pretty cool. But I don't want to add GPU support if that means people having to mess with CUDA or rocm drivers and report all sorts of issues. Unless it's something portable that works out of the box for people, I'm not interested.

Agreed. I have lost far too much of my time trying to set up CUDA + Torch.

I'm not sure how crazy it would be to build a tensor library on top of wgpu compute shaders. Just throwing that out there for anyone who feels crazy and/or adventurous enough 🤔 But that would eventually mean tensors on wasm, which is pretty darn cool I guess?

Check out wonnx! I'm not sure if it can be used independently from ONNX, but it would be super cool to figure out. Worth having a chat with them at some point.

You could also just run the existing CPU inference in WASM, I think - you might have to get a little clever with how you deal with memory, given the 32-bit memory space, but I think it should be totally feasible to run 7B on the web. The only reason I haven't looked into it is because the weights would have to be hosted somewhere 😅

Anyway, I'm happy to have someone else on board 😄 If there's anything I mentioned above you'd like to take the lead on, please say so! I'm not going to be making any big changes to the code in a few days.

I think we're in a pretty good place! I think it's just figuring out what the best "library API" would look like, and building some applications around it to test it. From there, we can figure out next steps / see what other people need.

Actually, I did a bit of port too lol. But I am not familiar with c bindings, and I am not so brave to port the ggml library, so I just rewrite some code and leave it there. it's the utils.cpp

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=46f060c6953d228fcb46ea67dea8e8b8

~~and I am also not sure whether it works properly. But it compiles.~~

@noeda of https://github.com/Noeda/rllama might wanna tag along here ☺️

Also, a Tauri-app equivalent to https://github.com/lencx/ChatGPT would pair very well with this. Good task for anyone who wants to be involved but doesn’t quite feel comfortable with the low level internals.

I've been testing on Windows with my patches applied and it seems to work fine. It's probably not as fast as it could be, but it's plenty fast enough!

Glad to hear it! Then I said nothing :) If you're able to test things there, we can aim for good Windows support too then. This is probably going to be something where Rust makes it a lot more simple to get going than the C++ version.

I've never really delved into that kind of lifetime trickery!

Me neither, I'm not even sure it is possible :thinking: But definitely interesting! Still, I'd rather go down the route of replacing ggml with a pure Rust solution than to spend a lot of time building safe bindings to ggml.

Well... I'd actually started on this before you replied :joy:

:heart:

it's not the best API, but it does prove that all you need to do is offer up a HTTP interface and They Will Come :tm:

I think that's a very good point! As for the HTTP interface, one interesting requirement the image generation APIs don't have is that with text, you generally want to stream the outputs.

A good way to do this without complicating the protocol is use something called "chunked transfer encoding" where the server sends bits of the response one piece at a time, and a compatible client can fetch the results as they come without waiting for the end of the HTTP response. Chunked transfer is a pretty old thing and should be well supported in every HTTP client. I know @darthdeus already did a little proof of concept and this works well.

That being said, that sounds pretty reasonable to do for both the CLI and/or the API. Either/or could serve as a "batteries-included" example of how to ship something that's consistently fast with the library.

Yes :) I'm really interested in making this as simple as possible. What we could do on the library side, is to have the main inference_with_prompt returns some MemoryOut<'model> struct containing several byte slice(s) for the context memory (lifetime would make sure refs never outlive the Model). We could also make that same function take an Option<MemoryIn>, which is the same struct but with owned vecs instead of slices, and that would replace the working memory from a pre-computed cache.

By default, callers just pass in None and ignore the result, and that gives them the original experience. So it's up to the caller to manage / store / serialize / whatever this cache if they want to.

Check out wonnx! I'm not sure if it can be used independently from ONNX, but it would be super cool to figure out. Worth having a chat with them at some point.

Will do!! :eyes: The idea of wgpu tensors is just so appealing in that it basically works anywhere with no driver issues and on any GPU.

I think we're in a pretty good place! I think it's just figuring out what the best "library API" would look like, and building some applications around it to test it. From there, we can figure out next steps / see what other people need.

Sounds good :)

@Noeda of https://github.com/Noeda/rllama might wanna tag along here :relaxed:

Also, a Tauri-app equivalent to https://github.com/lencx/ChatGPT would pair very well with this. Good task for anyone who wants to be involved but doesn’t quite feel comfortable with the low level internals.

Indeed! The more we are working on this, the better :smile:

As a first contact, some benchmarks comparing the ggml here and rllama's OpenCL implementations on CPU would be a good first step to evaluate whether other Rust tensor libraries would fit the bill :)

Howdy :) I am very happy too LLM stuff picking up in Rust.

rllama is currently a chimera hybrid of 16-bit and 32-bit floats, where 16-bit floats are used in OpenCL and 32-bit floats in operations not involving OpenCL.

As a first contact, some benchmarks comparing the ggml here and rllama's OpenCL implementations on CPU would be a good first step to evaluate whether other Rust tensor libraries would fit the bill :)

Currently in terms of performance or memory use rllama is not competetive with any of the ggml stuff. I have no quantization whatsoever.

I just checked my latest commit and on CPU only OpenCL I got 678ms per token. (with GPU, ~230ms). The llama.cpp project mentions in README.md that they are at around 60ms per token which is 4x faster than even my GPU version.

I have two ideas how to collaborate in near future:

Verification of results. I get reasonable text in my implementation but I don't know if it's really done all correctly, especially tokenization. Would our projects get same output if we set top_k=1, and use the same prompt?
Apples-to-apples benchmarking scripts. I currently run a shell script that tests each configuration of rllama (GPU on/off, LLaMA-7B vs LLaMA-13B). It's very ad-hoc.

I am currently working on removing more performance bottlenecks out which might improve my rllama performance and memory, but after that I can offer to make a simple verification + benchmark suite that knows how to run our projects and verify they get the same results. I also wanted to make pretty graphs showing memory or CPU utilization use over time etc. Maybe this would go into a new repository. If you have any ideas here, I'm all ears.

Excited for all us :) 👍

I just checked my latest commit and on CPU only OpenCL I got 678ms per token. (with GPU, ~230ms). The llama.cpp project mentions in README.md that they are at around 60ms per token which is 4x faster than even my GPU version.

I guess it depends on the CPU, but my times for the f16 models are closer to 230ms, so I'd be inclined to say GPU and CPU speed is comparable. This also matches my results from when I tried another gpu implementation. On the quantized models, I do get ~100ms/token.

Verification of results. I get reasonable text in my implementation but I don't know if it's really done all correctly, especially tokenization. Would our projects get same output if we set top_k=1, and use the same prompt?

That's a very good idea :) Other than setting top_k and the same prompt, we would need to make sure rng happens in the exact same way. We're currently using whatever rand's thread_rng gives by default, which is bad for reproducibility. Sampling is done using a rand WeightedIndex, and that's the only time the rng is invoked for each of the sampled tokens:

let dist = WeightedIndex::new(&probs).expect("WeightedIndex error");
let idx = dist.sample(rng);

So my guess is that as long as we're both using the rand crate, results should be comparable.

I can offer to make a simple verification + benchmark suite that knows how to run our projects and verify they get the same results

That would be amazing! :)

It's worth noting that quantisation affects both speed and quality, so any benchmarks should be done with the original weights (which will probably limit the maximum size that can be used). Additionally, llama.cpp seems to have some existing bugs around tokenisation and inference at f16.

That is to say - let's get this benchmark on the road, but I think we'll be returning slightly incorrect results until we can address those issues.

I want to share an idea for a project that I want to start implementing after I understand a little about the topic of AI and how everything works. Perhaps something similar already exists in some way, it would be interesting to learn about it.

Project Idea:

Introducing an entertaining and engaging podcast platform where AI hosts converse not only with each other but also with human guests for fun and profit. This concept allows AI to take the lead in creating captivating and lighthearted discussions, asking questions, and maintaining conversations with guests, whether they are AI or human.

Features to Implement:

AI host: The AI will assume the role of a host, responsible for initiating and sustaining interesting and enjoyable conversations with guests (both AI and human).
Text-to-Speech (TTS) integration: AI-generated text will be converted into speech using a TTS program, providing a more natural and engaging listening experience.
Speech-to-Text (STT) integration: Human speech will be converted into text using an STT program, allowing the AI host to interpret and respond to the guest's input.
Infinite conversation loop: The AI-hosted conversations will continue in a potentially endless loop until manually stopped by a behind-the-scenes operator.
AI host and guest models: Utilize AI models based on LLaMA, which can be run locally or accessed via API to servers (e.g., GPT-4, PaLM).
Offline functionality: Ideally, the podcast platform can be run locally on any PC without an internet connection, enabling users to participate without relying on external servers.
Live streaming integrations: Integrate with popular streaming platforms like YouTube and Twitch, allowing users to broadcast their AI-hosted podcasts live to a wider audience.
User-generated content: Enable users to submit their own topics, questions, or themes for AI-hosted conversations, providing a more interactive and personalized experience.
Voice customization: Offer a range of AI-generated voices and accents for the AI host and guests, allowing users to select their preferred voice styles.
Language support: Incorporate multilingual capabilities, enabling the AI host and guests to engage in conversations in various languages and catering to a diverse audience.

This project brings to life my childhood dream of listening how to machines converse with each other and with humans. The time has come to make this vision a reality, at least to some extent!

Project name and short description

Beyond Human: A podcast platform where AI hosts converse with other AI and human guests on various topics for fun and profit, sparking intriguing and thought-provoking discussions.

So far, I have vague ideas of how this can be implemented, but I feel that it must be done :)

Repository: https://github.com/ModPhoenix/beyond-human

rustformers / llm

Let's collaborate #4