tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥
https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov
MIT License
2.09k stars 139 forks source link

use mojo file reader, remove read/ folder #50

Closed rd4com closed 10 months ago

rd4com commented 10 months ago

Had to start from 0 to keep it clean, i struggle with git sorry

rd4com commented 10 months ago

Yes, sorry about that, fixed, also, the file descriptor got not closed, now fixed; just make sure the weights are the same, and that it all works fine

the commit history is messed up:

Might wait for next release (see ticket 1062)

tairov commented 10 months ago

commit history is not an issue, we have this wonderful feature when merging

image
mikowals commented 10 months ago

FWIW, it might not be that important here to avoid holding 2 copies of the weights. I understand the motivation not to but in the next steps when weight tensors are created new pointers are created by explicitly copying the data. So that is the bottleneck for memory usage. Any optimisations that don't remove those copies when making tensor weights won't matter.

Also from watching the memory consumption on my M1 MacBook as it runs TinyLlama the memory usage peaks at ~4.5 GB. So I think there is some memory optimisation (possibly MacOs specific) already happening somewhere. It could also be that it only briefly hits the double memory and somehow my monitoring tools ignore the spike.

tairov commented 10 months ago

Thanks @mikowals for sharing your insights and the results of your observations.

I must use that mem_usage feature I implemented in a custom fork of hyperfine -- https://github.com/tairov/hypertune that can grab RSS of executed commands to collect details about actual memory consumption.

rd4com commented 10 months ago

based on @mikowals observations i tried to do some modifications in order to do no copies:

it is using offsets instead of allocating for weights, the program does end with a free() error

But does it consume less memory ? (should) Does it smooth the bottleneck issue? if so, in the right way?

https://github.com/rd4com/llama2.mojo/tree/no_copy

The error is because on deletion it call free but in that case on the offsets: Tensor

mikowals commented 10 months ago

@rd4com that branch reimplements the error that copying was put in to avoid. For better or worse Tensor owns its pointer and will free it. I probably should have linked to the bug report (that turned out not to be a bug) in my comment above the copy.

Another approach to eliminate the copy would be to fuse read_file, TransformerWeights.__init__, and config_init so that a FileBuf does not need to be created and passed around. To get memory benefit from this I think it still requires reading the binary file in portions or as a stream. This is kind of a work around for the memory ownership but the binary file format also already tightly couples the functions so fusing them is also justified on that basis.

I also made a branch using NDBuffer which is very similar to Tensor but doesn't own the weights. So it was probably a better drop in replacement for Matrix but I didn't learn about it until later. It runs slightly slower than the Tensor version I think because NDBuffer's getters and setters use multi-dimension indices so that pulls the pointer offset calculation into the innermost loops. But this would be an option if the copy required by Tensor is a problem. But I expect Tensor will become the standard type for LLMs and will evolve the features and behaviours to make writing efficient code simple.

rd4com commented 10 months ago

It is possible to get the pointer out of the buffer, and give each tensors an offset, this allows zero copy and negate the tensor "free":

but it is not good practice maybe.

It might be better to wait for your features requests to get implemented in order to read in batches:

a branch is a possibility for thoses who want to load huge models,

what do you think ?

tairov commented 10 months ago

got error on mojo 0.5.0

/private/tmp/llama2.mojo/llama2.mojo:434:13: error: invalid mutation of immutable value 'fd.handle'
    fd.close()
            ^
/private/tmp/llama2.mojo/llama2.mojo:432:5: note: 'fd' declared here
    let fd = open(file_name, "r")
    ^
mojo: error: failed to run the pass manager