I am benchmarking a workload involving repeated language model calls, both with and without enforcement. It seems that recent versions of llama-cpp-python lead to severe performance degradation when combined with lm-format-enforcer. Here are some running times for two distinct workloads and versions of llama-cpp-python (with and without enforcement):
0.2.37
0.2.53
Workload A (without lmfe)
8.99 sec
9.35 sec
Workload B (with lmfe)
26.82 sec
122.53 sec
Intermediate versions of llama-cpp-python (higher than 0.2.37) are either not usable in conjunction with lmfe, or yield runtimes similar to 0.2.53. I'm not entirely sure whether this is a lmfe issue or a llama-cpp-python one. Moreover (possibly unrelated issue) these recent versions of llama-cpp-python yield empty outputs when used in conjunction with lmfe if the generation parameter "stream" is set to True and for certain language models.
Currently, I am still pinning my own version of llama-cpp-python to 0.2.37 due to the issues above.
I am benchmarking a workload involving repeated language model calls, both with and without enforcement. It seems that recent versions of llama-cpp-python lead to severe performance degradation when combined with lm-format-enforcer. Here are some running times for two distinct workloads and versions of llama-cpp-python (with and without enforcement):
Intermediate versions of llama-cpp-python (higher than 0.2.37) are either not usable in conjunction with lmfe, or yield runtimes similar to 0.2.53. I'm not entirely sure whether this is a lmfe issue or a llama-cpp-python one. Moreover (possibly unrelated issue) these recent versions of llama-cpp-python yield empty outputs when used in conjunction with lmfe if the generation parameter "stream" is set to True and for certain language models.
Currently, I am still pinning my own version of llama-cpp-python to 0.2.37 due to the issues above.