Figure out how to run official llama2 models

jrudolph commented 1 year ago

Even the smallest model, llama2-7b, has a size of ~26GB (because we only support float32). This means memory mapping the model is not possible because Java's ByteBuffer only supports up to 2^31 elements in its API.

Potential solutions:

move to different memory mapping method and use Unsafe to access the values (not so easily portable, e.g. to native-image)
Create multiple maps for different weights. For llama2-7b the FFN weights are of shape (4096,11008, 32) so that they alone need a map of size ~ 2^32, so we would probably have to split by layers into e.g. 4 chunks (which would probably add acceptable complexity, one extra indirection to figure out the right map per layer and weights set)
Support float16 or even coarser quantizations

jrudolph commented 1 year ago

https://github.com/jrudolph/llama2.scala/tree/load-official-llama.bin works for running the llama2 models but it is quite slow (~10-15s per token)

plokhotnyuk commented 1 year ago

Please check if 63-bit sized byte buffers and memory mapped files from https://github.com/OpenHFT/Chronicle-Bytes are suitable for your project

jrudolph commented 1 year ago

Thanks for the suggestion . In the branch above I already found a good enough solution for now. Chronicle-Bytes seems interesting but is a bit too heavy-weight for this purpose.

jrudolph commented 1 year ago

A shortcut to using the same technique as Chronicle-Core would be use the same hack to get a raw memory address to mapped memory, or to use something like JNA to mmap the file. To access the data Unsafe.getFloat can be used (which is an intrinsic in the JVM).

jrudolph commented 1 year ago

Works now in all configurations.

jrudolph / llama2.scala

Figure out how to run official llama2 models #1