Allow `mmap`ing weights to load larger models

ericphanson commented 1 year ago

I tried to load the llama 7B model on my 16 GB ram m1 macbook pro, and it ran out of ram very quickly. So here I've used the Mmap stdlib to load the weights lazily. This seems to work, in that the tests pass, and I can start inference with the 7B model, although it's super slow:

In [1]: using LanguageModels

In [2]: dir = abspath(joinpath("../llama"))
Out[2]: "/Users/eph/llama"

In [3]: LanguageModels.main(checkpoint_filename="$dir/llama-2-7b.bin", tokenizer_filename="$dir/tokenizer.model", tokenizer_loader=LanguageModels.load_sentencepiece_model, prompt="Once upon a time, there was a llama called",mmap=true)
[ Info: dim = 4096
[ Info: hidden_dim = 11008
[ Info: n_layers = 32
[ Info: n_heads = 32
[ Info: n_kv_heads = 32
[ Info: seq_len = 2048
[ Info: shared_weights = false
[ Info: vocab_size = 32000
Once

(that's as far as it's gotten in the last couple minutes, although an earlier buggy run got further). Additionally the ram usage seems stable at ~2.27 GB according to activity monitor.

I made mmap=false the default since it seems slower for the smaller default model (700 tok/s without mmapping vs 125 tok/s with mmapping for me).

jiahao commented 1 year ago

Thanks! looks great. Have a few minor comments and questions

ericphanson commented 1 year ago

Super simple benchmark (the tests) post- 0dc057dd38b579433e763c40fd750a02cb51a700 :

     Testing Running tests...
[ Info: dim = 288
[ Info: hidden_dim = 768
[ Info: n_layers = 6
[ Info: n_heads = 6
[ Info: n_kv_heads = 6
[ Info: seq_len = 256
[ Info: shared_weights = true
[ Info: vocab_size = 32000

[ Info: achieved tok/s: 700.3423744068633
[ Info: dim = 288
[ Info: hidden_dim = 768
[ Info: n_layers = 6
[ Info: n_heads = 6
[ Info: n_kv_heads = 6
[ Info: seq_len = 256
[ Info: shared_weights = true
[ Info: vocab_size = 32000

[ Info: achieved tok/s: 732.5528663587523
[ Info: dim = 288
[ Info: hidden_dim = 768
[ Info: n_layers = 6
[ Info: n_heads = 6
[ Info: n_kv_heads = 6
[ Info: seq_len = 256
[ Info: shared_weights = true
[ Info: vocab_size = 32000

[ Info: achieved tok/s: 723.0568641150903

The first 2 have mmap=false and the third has mmap=true and is the same test as the second one.

jiahao commented 1 year ago

lgtm; happy to merge once you're happy cleaning up the branch. Would be good to rebase if it's not too messy

ericphanson commented 1 year ago

lgtm; happy to merge once you're happy cleaning up the branch. Would be good to rebase if it's not too messy

Ok, I gave rebasing a go, squashing down to 2 commits (that are orthogonal). I am mostly used to squash-and-merging PRs so hopefully I've done it right 😅

rai-llc / LanguageModels.jl

Allow `mmap`ing weights to load larger models #2