jrudolph / llama2.scala

Inference Llama 2 in Scala with AVX2 kernels in C (A port of llama2.c from Andrej Karpathy)
Other
67 stars 3 forks source link

Figure out how to run official llama2 models #1

Closed jrudolph closed 1 year ago

jrudolph commented 1 year ago

Even the smallest model, llama2-7b, has a size of ~26GB (because we only support float32). This means memory mapping the model is not possible because Java's ByteBuffer only supports up to 2^31 elements in its API.

Potential solutions:

jrudolph commented 1 year ago

https://github.com/jrudolph/llama2.scala/tree/load-official-llama.bin works for running the llama2 models but it is quite slow (~10-15s per token)

plokhotnyuk commented 1 year ago

Please check if 63-bit sized byte buffers and memory mapped files from https://github.com/OpenHFT/Chronicle-Bytes are suitable for your project

jrudolph commented 1 year ago

Thanks for the suggestion . In the branch above I already found a good enough solution for now. Chronicle-Bytes seems interesting but is a bit too heavy-weight for this purpose.

jrudolph commented 1 year ago

A shortcut to using the same technique as Chronicle-Core would be use the same hack to get a raw memory address to mapped memory, or to use something like JNA to mmap the file. To access the data Unsafe.getFloat can be used (which is an intrinsic in the JVM).

jrudolph commented 1 year ago

Works now in all configurations.