Open brittlewis12 opened 1 month ago
on our end, we've stopped using the llama-cpp-rs
sampling API, I kept it around because removing it was more work.
The closer we can be to safe direct bindings to upstream the happier I am, ergonomics is secondary. I like the direction the PR is going (wrapping the llama.cpp
structs and their calls).
If you want to get it into a minimal state you would be happy to use I can do a more detailed review (it looks pretty close). No need to cover 100% of the API - just whatever you find useful.
Llama.cpp implemented breaking sampling API refactors which changed from exposing stateless sampling functions for use in the generation loop to a stateful sampler chain which is setup up front and can be manipulated from there.
I took a stab at spiking out how integrating the new API into llama-cpp-rs could look, but it's far from releasable. I stopped once I got
simple
running successfully.Would love your perspective on approaches to unifying the existing sampling interfaces exposed by the crate, what to do about llama token data array sampling functions, handling grammar sampling, as well as feedback on the rust seams over llama.cpp's latest sampling APIs.
Once we're on the same page, I can flesh out last missing pieces like sampler timings & any higher level abstractions over sampling parameters — it seems the library doesn't expose any sampling parameter structs directly any more :/
I have these changes up as a branch on my fork, but if you'd prefer I'm more than happy to collaborate on a proper draft pull request.