To improve latency and costs, LLM completions, OCR results and other predictions should be cached (disk backed).
Taking into account all kinds of parameters that should invalidate (different model or changed prompt) or disable (e.g. temperature > 0) a cache.
To improve latency and costs, LLM completions, OCR results and other predictions should be cached (disk backed). Taking into account all kinds of parameters that should invalidate (different model or changed prompt) or disable (e.g. temperature > 0) a cache.