Closed ericphanson closed 1 year ago
Thanks! looks great. Have a few minor comments and questions
Super simple benchmark (the tests) post- 0dc057dd38b579433e763c40fd750a02cb51a700 :
Testing Running tests...
[ Info: dim = 288
[ Info: hidden_dim = 768
[ Info: n_layers = 6
[ Info: n_heads = 6
[ Info: n_kv_heads = 6
[ Info: seq_len = 256
[ Info: shared_weights = true
[ Info: vocab_size = 32000
[ Info: achieved tok/s: 700.3423744068633
[ Info: dim = 288
[ Info: hidden_dim = 768
[ Info: n_layers = 6
[ Info: n_heads = 6
[ Info: n_kv_heads = 6
[ Info: seq_len = 256
[ Info: shared_weights = true
[ Info: vocab_size = 32000
[ Info: achieved tok/s: 732.5528663587523
[ Info: dim = 288
[ Info: hidden_dim = 768
[ Info: n_layers = 6
[ Info: n_heads = 6
[ Info: n_kv_heads = 6
[ Info: seq_len = 256
[ Info: shared_weights = true
[ Info: vocab_size = 32000
[ Info: achieved tok/s: 723.0568641150903
The first 2 have mmap=false
and the third has mmap=true
and is the same test as the second one.
lgtm; happy to merge once you're happy cleaning up the branch. Would be good to rebase if it's not too messy
lgtm; happy to merge once you're happy cleaning up the branch. Would be good to rebase if it's not too messy
Ok, I gave rebasing a go, squashing down to 2 commits (that are orthogonal). I am mostly used to squash-and-merging PRs so hopefully I've done it right 😅
I tried to load the llama 7B model on my 16 GB ram m1 macbook pro, and it ran out of ram very quickly. So here I've used the Mmap stdlib to load the weights lazily. This seems to work, in that the tests pass, and I can start inference with the 7B model, although it's super slow:
(that's as far as it's gotten in the last couple minutes, although an earlier buggy run got further). Additionally the ram usage seems stable at ~2.27 GB according to activity monitor.
I made
mmap=false
the default since it seems slower for the smaller default model (700 tok/s without mmapping vs 125 tok/s with mmapping for me).