Performance degradation with recent versions of llama-cpp-python

I am benchmarking a workload involving repeated language model calls, both with and without enforcement. It seems that recent versions of llama-cpp-python lead to severe performance degradation when combined with lm-format-enforcer. Here are some running times for two distinct workloads and versions of llama-cpp-python (with and without enforcement):

	0.2.37	0.2.53
Workload A (without lmfe)	8.99 sec	9.35 sec
Workload B (with lmfe)	26.82 sec	122.53 sec

Intermediate versions of llama-cpp-python (higher than 0.2.37) are either not usable in conjunction with lmfe, or yield runtimes similar to 0.2.53. I'm not entirely sure whether this is a lmfe issue or a llama-cpp-python one. Moreover (possibly unrelated issue) these recent versions of llama-cpp-python yield empty outputs when used in conjunction with lmfe if the generation parameter "stream" is set to True and for certain language models.

Currently, I am still pinning my own version of llama-cpp-python to 0.2.37 due to the issues above.

noamgat / lm-format-enforcer

Performance degradation with recent versions of llama-cpp-python #82