mukel / llama2.java

Inference Llama 2 in one file of pure Java
MIT License
211 stars 28 forks source link

Exception in thread "main" java.lang.ArithmeticException: / by zero #4

Closed msche closed 1 year ago

msche commented 1 year ago

I have downloaded the llama 7B version of the model and prepared it as described in llama.cpp

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the 7B model to ggml FP16 format
python3 convert.py models/7B/

  # [Optional] for models using BPE tokenizers
  python convert.py models/7B/ --vocabtype bpe

# quantize the model to 4-bits (using q4_0 method)
./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0

# update the gguf filetype to current if older version is unsupported by another application
./quantize ./models/7B/ggml-model-q4_0.gguf ./models/7B/ggml-model-q4_0-v2.gguf COPY

I then attempted to use the model by executing:

java --enable-preview --add-modules=jdk.incubator.vector Llama2 ./models/7B/ggml-model-q4_0-v2.gguf -n 128

but the execution fails with:

WARNING: Using incubator modules: jdk.incubator.vector
Exception in thread "main" java.lang.ArithmeticException: / by zero
        at Config.<init>(Llama2.java:52)
        at Transformer.<init>(Llama2.java:185)
        at Llama2.main(Llama2.java:1000)
msche commented 1 year ago

NOTE: I used the model I generated within llama.cpp and there the model was working.

mukel commented 1 year ago

I'm keeping llama2.java 100% compatible with the original llama2.c to preserve the educational value. This means that it only supports the simple .bin format for the weights.

I was so curious, that, secretly, I wrote an augmented version of llama2.java with additional features. It can read the old .ggml and the new .gguf formats but also the huggingface (llama) models. I also implemented Q4_0, Q4_1 and Q8_0 quantization (no k-quant support yet since I couldn'take the matmuls fast enough). It also supports the CoddeLlama models. With Q4_0 it runs at ~9 tokens/s but quality-wise I prefer the Q8_0.

You can convert the huggingface original Llama models to the .bin format with this script.

msche commented 1 year ago

could you share the augmented version? Curious how it works