Slow Performance compared to Python Binding

I been playing around with the Python and Rust bindings of llama and noticed that Python was producing content much faster despite same model / input.

When I printed out the args/specs of the run I noticed some things were missing from the Rust binding that Python was using.

llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: n_yarn_orig_ctx  = 2048

I am not sure if either Python is using better specifications or I am using inappropriate poor specs because I been playing around with threads, n_batch, batch and n_gpu layers. I tried to find comments via the Rust code but couldn't find anything.

Ex. (Python's Binding)

Any recommendations?

mdrokz / rust-llama.cpp

Slow Performance compared to Python Binding #19