noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.42k stars 65 forks source link

Performance degradation with recent versions of llama-cpp-python #82

Closed pinoloricato closed 4 months ago

pinoloricato commented 7 months ago

I am benchmarking a workload involving repeated language model calls, both with and without enforcement. It seems that recent versions of llama-cpp-python lead to severe performance degradation when combined with lm-format-enforcer. Here are some running times for two distinct workloads and versions of llama-cpp-python (with and without enforcement):

0.2.37 0.2.53
Workload A (without lmfe) 8.99 sec 9.35 sec
Workload B (with lmfe) 26.82 sec 122.53 sec

Intermediate versions of llama-cpp-python (higher than 0.2.37) are either not usable in conjunction with lmfe, or yield runtimes similar to 0.2.53. I'm not entirely sure whether this is a lmfe issue or a llama-cpp-python one. Moreover (possibly unrelated issue) these recent versions of llama-cpp-python yield empty outputs when used in conjunction with lmfe if the generation parameter "stream" is set to True and for certain language models.

Currently, I am still pinning my own version of llama-cpp-python to 0.2.37 due to the issues above.