Closed leigao97 closed 1 day ago
hmm, indeed the latest changes to verify cache implementation in the config broke offloaded cache. You can try to pass in the cache object directly to generate()
until it is fixed.
from transformers import OffloadedCache
model.generate(**inputs, past_key_values=OffloadedCache())
cc @gante if you have this already on your radar or I can fix it next week since you'll be off for until December
cc @ArthurZucker as well
Since @gante is not here, @zucchini-nlp can you have a look ?
System Info
transformers
version: 4.46.2Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am following the official example to enable KV cache offloading. https://huggingface.co/docs/transformers/en/kv_cache#offloaded-cache
And I got the error message:
Expected behavior
I expected that
cache_implementation="offloaded"
is a valid option taken bymodel.generate()
. After enabling KV cache offloading, the peak memory usage should go down and inference time should go up.