Move to `LLAMA_CACHE` 🤗

Vaibhavs10 commented 1 month ago

Hi @cocktailpeanut,

Big fan of your work! and love all that you're doing to democratise ML. Congratulations on llamanet, it looks rad!

I saw that you are creating your own cache, llamanet and persisting models there (correct me if I'm wrong). We recently upstream changes to llama.cpp which allows one to directly download and cache the models from the Hugging Face Hub (Note: for this you'd need to compile the server with LLAMA_CURL=1)

With the curl support all you'd need to do is pass --hf-repo & --hf-file and the model checkpoint would automatically be downloaded and cached in LLAMA_CACHE ref

This would make it easier for people to use already cached model checkpoints and should benefit well in case we make improvements to the overall caching system too.

AFAICT, you should be able to benefit from this directly by changing this line: https://github.com/pinokiocomputer/llamanet/blob/16fc9521f97549c657f80ce51c5c7a787eac4e8d/llamacpp.js#L20

Let me know what you think! VB

cocktailpeanut commented 1 month ago

@Vaibhavs10 yes i actually loved that --hf-repo feature and it was the approach i was using early on, but eventually had to comment it out (see: https://github.com/pinokiocomputer/llamanet/blob/main/llamacpp.js#L63-L65) and manually download instead.

The reason was actually exactly what you mentioned. It seems that the prebuilt binaries on the releases page are not compiled with the LLAMA_CURL=1 option, and if I wanted to get this to work using the --hf-repo option, I would have to build the files myself on the fly, and make it work on every platform.

I did try this approach with my past project Dalai https://github.com/cocktailpeanut/dalai where I did everything programmatically, from running quantization to running cmake, all programatically. It worked for most cases but the problem is always the edge cases, where the cmake commands would fail for some reason. It was too messy to try to do all this through the library, which is why this time around I wanted to avoid it as much as possible, which is why I'm downloading the prebuilt binaries from the releases page instead.

So at the moment I don't really have an option to use the --hf-repo because I will probably avoid running the cmake command within the library.

That said, I was made aware of huggingface.js yesterday and looking into that one. Using that would give us the same benefits, right?

EDIT: Just looked through the huggingface.js docs and the source and it looks like it doesn't do what I thought it does, seems to be designed to work in the browser with no access to the file system.

Vaibhavs10 commented 1 month ago

Aha! It makes sense and that's also a good feedback too! Let me see what can be done about this! 🤗

Perhaps we can upstream this change up to llama.cpp.

pinokiocomputer / llamanet

Move to `LLAMA_CACHE` 🤗 #1