Closed msche closed 1 year ago
NOTE: I used the model I generated within llama.cpp and there the model was working.
I'm keeping llama2.java
100% compatible with the original llama2.c
to preserve the educational value. This means that it only supports the simple .bin
format for the weights.
I was so curious, that, secretly, I wrote an augmented version of llama2.java
with additional features. It can read the old .ggml
and the new .gguf
formats but also the huggingface (llama) models.
I also implemented Q4_0, Q4_1 and Q8_0 quantization (no k-quant support yet since I couldn'take the matmuls fast enough). It also supports the CoddeLlama models. With Q4_0 it runs at ~9 tokens/s but quality-wise I prefer the Q8_0.
You can convert the huggingface original Llama models to the .bin
format with this script.
could you share the augmented version? Curious how it works
I have downloaded the llama 7B version of the model and prepared it as described in llama.cpp
I then attempted to use the model by executing:
but the execution fails with: