tjake / Jlama

Jlama is a modern LLM inference engine for Java
Apache License 2.0
654 stars 60 forks source link

GGUF Support #36

Open vaiju1981 opened 4 months ago

vaiju1981 commented 4 months ago

Is there any plan to support GGUF format directly apart from SafeTensor, that will allow to use this to load other GGUF's. If support already exists can we add it to readme file.

tjake commented 4 months ago

I could support some of the quantization types. Is that the main reason vs safetensors?

vaiju1981 commented 4 months ago

The main reason is that GGUF are small ( compared to safetensor ) and it makes our testing/usage easier. Apart from that the different quantization.

Currently we are using a different library (deepjavalibrary) to load GGUF model via LlamaEngine. This will make it easier to have that via Jlama support.

tjake commented 4 months ago

Hmm can you give me an example? The GGUF and Safetensor of the same model with same quantization is pretty much the same. Maybe they changed GGUF since I last looked.

vaiju1981 commented 4 months ago

So when i meant small size, i am implying downloading from HuggingFace/repos. Downloading quantized model vs safetensors and downloading.

with GGUF, one has vocabulary and other things such as prompt templates are part of same file and don't need to be downloaded separately.

tjake commented 4 months ago

Jlama does the downloading for you. It only needs 4 of the files

tjake commented 4 months ago

If there are models you would like me to quantize and upload please request here https://github.com/tjake/Jlama/discussions/37

mbaudier commented 2 months ago

Since GGUF is the format used by llama.cpp (which is the most widely used native tool for running models locally), one tends to first rather download the GGUF files in order to quickly test a model with llama-cli.

Having two formats therefore leads to duplication, for example:

$ du -sh Meta-Llama-3.1-8B-Instruct-*
6.1G    Meta-Llama-3.1-8B-Instruct-Jlama-Q4
4.3G    Meta-Llama-3.1-8B-Instruct-Q4_0.gguf

Apparently someone has already written a GGUF interpreter in Java: llama3.java. Since it supports only llama 3.x it is probably not complete, but may be it could be a starting point?

tjake commented 2 months ago

Hi,

Yeah I saw that and will consider adding it but there's a couple issues.

Since this is a solo project I need to weigh the burden of supporting both. It may make the most sense to switch from safetensor over to GGUF but for me tool support and better distributed inference is higher priority for me.

Some issues with GGUF vs SafeTensor

  1. GGUF is columnar layout while Safetensor is row layout supporting both layouts is not great.
  2. GGUF often uses the Q_K quantizations which means many more types to support across diff CPUs and possibly GPU.

So overall I concede GGUF support would be cool, but just not ATM.

mbaudier commented 2 months ago

Thanks for the details! This sounds perfectly reasonable not to prioritise it. Moreover to have SafeTensor support in Java has the added benefit that the original models tend to be in this format.