jrudolph / llama2.scala

Inference Llama 2 in Scala with AVX2 kernels in C (A port of llama2.c from Andrej Karpathy)
Other
67 stars 3 forks source link

q8 quantization (SIMD / AVX2) #10

Closed jrudolph closed 1 year ago

jrudolph commented 1 year ago

Combination of q8 quantization (#7) and AVX2 optimizations (#9).

Runs llama-7b with ~2 tokens per second (q8 alone: 0.45 tokens per second)