Closed LLukas22 closed 1 year ago
Sample command for 8k context of llama 2:
cargo run --release --features cublas -- infer -a llama -m "C:\Users\lkreu\Downloads\llama-2-13b-chat.ggmlv3.q5_K_M.bin" -p "A llama riding a crab" --use-gpu --rope-scaling 0.5 --num-ctx-tokens 8192 --ignore-eos --stats
Sit back and get some coffee☕ (8192 tokens are a lot of tokens to be generated)
16k context is also possible by setting rope-scaling
to 0.25
but then i don't have enough VRAM to infer on my GPU.
The generated text gets repetitive after some time but i guess that's a smapler/setting issue. lama_story.txt
Great work! I just tested it with LLongMa-2; it's a bit finicky, but that shouldn't be a problem from us. I've revised the names a little to match llama.cpp / refer to frequency
, but the rest is the same. Will merge once CI passes 🚀
Closes https://github.com/rustformers/llm/issues/378.
Adds custom context scaling to
llama
,falcon
,gpt-j
,gpt-neox
.Adds an
Option<ggml::CustomRoPEArguments>
parameter to theModelParameters
.Adds the optional
--rope-base
and--rope-scaling
cli parameters.