Closed jrudolph closed 1 year ago
https://github.com/jrudolph/llama2.scala/tree/load-official-llama.bin works for running the llama2 models but it is quite slow (~10-15s per token)
Please check if 63-bit sized byte buffers and memory mapped files from https://github.com/OpenHFT/Chronicle-Bytes are suitable for your project
Thanks for the suggestion . In the branch above I already found a good enough solution for now. Chronicle-Bytes seems interesting but is a bit too heavy-weight for this purpose.
A shortcut to using the same technique as Chronicle-Core would be use the same hack to get a raw memory address to mapped memory, or to use something like JNA to mmap the file. To access the data Unsafe.getFloat
can be used (which is an intrinsic in the JVM).
Works now in all configurations.
Even the smallest model,
llama2-7b
, has a size of ~26GB (because we only support float32). This means memory mapping the model is not possible because Java's ByteBuffer only supports up to 2^31 elements in its API.Potential solutions:
(4096,11008, 32)
so that they alone need a map of size ~ 2^32, so we would probably have to split by layers into e.g. 4 chunks (which would probably add acceptable complexity, one extra indirection to figure out the right map per layer and weights set)