Open irthomasthomas opened 2 months ago
I made some improvements. I've been using this version for a few days and it seems to work well. It automatically caches continued conversations if they where flagged for caching.
Here is a long conversation with prompt caching enabled. It compares the cached price to what it would have cost without.
Now that I have played with it for a week, I think I want a menu to manage the cache options:
llm cache on/off - toggle always-on caching llm cache TTL [minutes] - default time to expire caches llm cache list - list active keep-alive caches
This will also need a prompt option to specify a keep-alive time llm -o cache_prompt 1 -o keep_alive 60 (cache prompt and keep it active for 60 minutes.)
Also, prefer to use flags like --cache, rather than -o cache_prompt 1, if possible.
Currently we can independently choose to cache system or prompt. Is this useful? Are their cases when we would want to cache one but not the other?
Files changed:
I do not think that I like the interface, tbh, but see what you think. It's hard to trade off features and complexity. I did not think that caching should be turned on by default, due to the 25% premium on cache inputs. This first implementation is less flexible than the API allows. The API allows for you to include both cached and none-cached system and user prompts in one request. I wasn't sure how to do that without editing the main projects cli.py?
I think ideally the interface would be llm --system "not cached system prompt" --cached-system "this system prompt will be cached" --cached-user "a user prompt to be cached" "none cached prompt?" But how do we handle the case of requesting ONLY a cached user prompt? I think the current cli demands a prompt, so that would need updating.
Anyway, here is the current implementation. To use it, you use prompt and --system as normal, and then you can choose to cache or not cache either one by adding -o cache_prompt 1 and/or -o cache_system 1 Then you get cache metadata returned in the json.
Added option
cache_prompt
andcache_system
To use
"usage": {"input_tokens": 10000, "output_tokens": 500, "cache_creation_input_tokens": 10000, "cache_read_input_tokens": 0}}
This first prompt requests to cache the system prompt. I used a file of 10,000 tokens, and as this is the first prompt we see "cache_creation_input_tokens": 10000 returned in the json. This means we paid 25% more for those tokens, but future requests using the same system prompt will be discounted 90%, as long we re-prompt within the cache TTL, currently 5 minutes (refreshed each time).
"usage": {"input_tokens": 10001, "output_tokens": 500, "cache_creation_input_tokens": 0, "cache_read_input_tokens":10000 }}
This sends a new user prompt but the same system prompt, along with the cache_system flag this means we will use the cached system prompt from the previous command. We can see we hit the cache by the usage response: "cache_read_input_tokens":10000
"usage": {"input_tokens": 15000 "output_tokens": 500, "cache_creation_input_tokens": 5000, "cache_read_input_tokens":10000 }}
This time we CREATE a USER prompt cache, and also READ a SYSTEM cache. Hence: "cache_creation_input_tokens": 5000, "cache_read_input_tokens":10000
"usage": {"input_tokens": 15000 "output_tokens": 500, "cache_creation_input_tokens": 0, "cache_read_input_tokens" :15000
Finally, running the same command again causes both system and user prompt cache reads.