Simplify Initial Setup for LLM Newcomers Using llama.jl Package

Thank you for creating this wrapper! I was about to do it myself -- glad I noticed the link in the past jll PRs :)

I wanted to test an idea with you.

People can use Llama.cpp directly if they want the low-level control, but that requires knowledge of prompt templates and compiling source code etc.

What if this package would serve as a Julia-only entry to have an LLM run on your laptop? Ie, no need to install anything else, we'll create a turnkey solution for you to get started.

Ollama is awesome and super user-friendly, but it's a separate application to download and its own limitations (eg, some performance issues, ngl defaults, etc) There are many others (ooba, ..), but all are separate tools to install...

What do you think? I'm happy to draft a PR.

Objective: Enhance the onboarding experience for first-time users of LLMs with llama.jl by simplifying the initial setup process. Just: use Llama; run_server()

Proposal:

Implement a lightweight tracker of a few models in the artefact system or directly via HuggingFace hub (eg, 1-2 models in each size class). The goal is not to compete with HuggingFace or Ollama
Introduce an easy way to download a model. I call an alias, if a model isn't in the local folder, it automatically downloads from a provided URL.
Simple list of available models, eg, list_models (following the example of MLJ and their models)
Introduce a mechanism to pick a default model for run_server if no model argument is provided
add some API re-use vs restart mechanism (eg, reuse if kwargs don’t change, restart if model changes)
overtime we could roll our own server on top the libs, but it’s super low prio for me (I’d prefer to focus on shipping than duplicating)

Benefits:

Streamlines the process for new users, allowing them to start with minimal configuration.
Reduces the need for understanding complex aspects of LLMs initially.

Feedback and suggestions are welcome.

Disclaimer: I'm an author of https://github.com/svilupp/PromptingTools.jl, so I'd leverage the API from there and deepen the integration

EDIT: I have other goals/aspirations for llama.cpp jll (eg, data extraction with the grammar), but I think we should first simplify the setup for users.

Hi @svilupp , many thanks for your PR. I am super excited that someone else is interested in this package!

This package is in a bit of an neglected state, and I have been busy with other things (thesis writing...). It is a bit late here tonight, so I'll read your comments and give you a longer answer tomorrow.

It looks like you have made lots of interesting packages and progress to making LLMs available in Julia. Maybe you want to be co-maintainer of this package as well? If yes I'll try to figure out how to enable this tomorrow.

I have bumped the version of Llama.jl to 0.2.0 with your PR included. Unfortunately there were some errors when regenerating the bindings, the unfinished PR is in #3 for now. I'll have a look at it tomorrow.

Thank you. Yeah, happy to join up forces!

My thoughts for llama.jl would be:

setup one-liner LLM server
add formatter, text, actions, etc to be able to register it
unexport any advanced or not-well tested functionality for now
register pkg for easy access
once popular enough, add to the awesome list for GenAI in Julia

In terms of llama.cpp features that I’m keen on, it’s

grammar to power data extraction to exact json spec (currently only possible with openai)
lora adapter (fine tuning for Julia coding, Julia documentation RAG)

I can see that the open-source models tend to struggle with Julia and every once in a while try to squeeze some python in, so it would be nice to add some finetuning specific for us. Naively hoping that within the next year we will manage to jointly build a dataset to do that (that’s why I started the LLM benchmark).

In general, I’d love to build some tooling in Julia to boost documentation search with RAG, have agents for writing your unit tests and docstrings, etc.

The smaller the community, the more we should look for ways to automate the non-core activities.

Any thoughts? (In particular on the immediate steps with llama.jl)

Thanks for all the PRs!

I think this sounds like a good plan. Here are some of my thoughts:

Server interface vs C interface, exports

As you said the server interface is probably the easiest way to interact with llama.cpp now, and hopefully won't undergo as many changes as the C interface has, i.e. currently the C low-level interface here in Llama.jl doesn't work anymore, as too many things have changed. I currently don't have the bandwidth to fix this, so this might stay this way for a while longer. Perhaps llama.cpp's C interface will eventually stabilize. So I agree that it's best to probably not export this low-level functionality for the time being.

Model downloads

Making the initial model download as easy as possible is a great idea. AFAIK at the moment it's unfortunately not possible to declare single-file artifacts in Julia. See: https://github.com/JuliaLang/Pkg.jl/issues/2764. I had a longish discussion on Slack with the Pkg.jl devs but there seemed to be little enthusiasm for single-file artifact support. The alternative of creating a new tarball for each model is unrealistic as you said. I would try to avoid this as much as possible and simply rely on HuggingFace. The models do tend to become outdated very fast at the moment.

Maybe these packages can be helpful, though I haven't tried them: https://github.com/FluxML/HuggingFaceApi.jl https://github.com/cjdoris/HuggingFaceHub.jl https://github.com/oxinabox/DataDeps.jl

My feeling is that I would avoid hardcoding any default models in run_server, as the models tend to change quite quickly and the default would then become part of the interface. Running a download_model once shouldn't be too bad, and the README or docs can give a helpful suggestion for some good models that the user can simply copy&paste.

Applications

Integration of LLMs into the REPL, vscode, grammar-guided sampling etc. sound like great ideas. A julia-specific LoRA or RLHF (or similar) model would be a great step forward for using these models effectively with julia. Maybe there could be a feedback mechanism to rate answers such as in https://chat.lmsys.org/ in your julia-LLM leaderboard, or the REPL/vscode could perhaps have a thumbs-up/thumbs-down kind of feedback mechanism that could gather data for fine-tuning.

marcom / Llama.jl