LLaMA-2 GGML formats fail to generate any new token

Here are the particular details that have failed:

System Mac M1 Pro 16G RAM 10 cores

llm

Branch: main
Default features false
features = ["llama"]
[Optionally] metal

Sequence of events observed through RUST_LOG=trace

...
...
Loaded tensor 360/363
...
Model size = 13152.46 MB / num tensors = 363`
TRACE llm_base::inference_session                   > Starting inference request with max_token_count: 1844674407370955161
// logging callback for llm::InferenceResponse::PromptToken(t) I do see the tokenised strings
TRACE llm_base::inference_session                   > Finished feed prompt
SamplerFailure(NoToken) // sampler throws an error - using `llm_samplers`

If I test with llm::InferenceParameters::default() (basically without specifying the sampler) it seems to just stall at this stage, so basically after the feed prompt no new tokens are being generated.

Models:

llama-2-13b-chat.ggmlv3.q6_K.bin
Looking at the current repo realised that llama.cpp submodule (crates/ggml/sys/llama-cpp) is at commit 8183159. So I checked out that commit in llama.cpp and generated fresh ggml file from Meta's LLaMA2-13b-chat weights for q8_0 (q8_0 with same codebase had worked before for a 7B model a couple of weeks back). Same behaviour can be observed.

I know because of upstream changes to GGUF this effort is kind of broken as of now and might be a few days/ weeks to stabilise, is there any way I can debug and try to figure out whats going on?

Any tips/ directions to get this up and running in the mean time would be super helpful.

rustformers / llm

LLaMA-2 GGML formats fail to generate any new token #413