This update makes some major latency improvements through a few different upgrades.
Improves overall throughput by converting the model to batch mode on a per topic basis. Instead of multiple calls for each topic, all categories are check in one pass.
The LLM call is now defaulting to gpt4o.
The LLM call has a shorter prompt that just returns a json list, improving latency significantly.
note: We tried using function calling to improve the reliability of the llm, but it is much slower. Using json mode + telling the model to use json is faster than relying on function calling.
Overall code cleanup for readability and maintainability
Most recent tests show ~700ms for the llm call, and around 1.5s total for inference on my m2 based mac on cpu. With a gpu it should be faster.
disabling the llm and with a gpu, latency can be as low as 300ms for a single validation call
This update makes some major latency improvements through a few different upgrades.
Improves overall throughput by converting the model to batch mode on a per topic basis. Instead of multiple calls for each topic, all categories are check in one pass.
The LLM call is now defaulting to gpt4o.
The LLM call has a shorter prompt that just returns a json list, improving latency significantly. note: We tried using function calling to improve the reliability of the llm, but it is much slower. Using json mode + telling the model to use json is faster than relying on function calling.
Overall code cleanup for readability and maintainability
Most recent tests show ~700ms for the llm call, and around 1.5s total for inference on my m2 based mac on cpu. With a gpu it should be faster.
disabling the llm and with a gpu, latency can be as low as 300ms for a single validation call