Closed Mihaiii closed 2 weeks ago
Hi Mihai, thanks for the contribution!
I actually had a similar optimisation in the first version, but got rid of it because of a warning that using past key values in that way was being deprecated soon. Were you getting the same warning message with this code?
I'm about 90% finished some fairly major changes & fixes, one of which is to generate for more than 1 token per iteration and use stopping conditions to stop generation. This is much faster and uses the cache without needing to manage it ourselves.
It may still be worth storing a cache like this since we are still doing the generation piecemeal. Let's check in with this again once I've uploaded the latest version.
Hi!
Yup, makes sense. I noticed that the class that inherits StoppingCriteria
is not actually passed to the Huggingface code for stopping sampling - I'm looking forward to the refactoring that will use it.
To answer your question, I don't remember seeing that warning.
Ok, I've uploaded the refactor.
That is quite some refactoring 😃
I can't try it now, but I looked at the code and it seems that the initial issues should be resolved.
Thanks! Closing! :)
This change significantly improves generation time, since we always generate one token in a loop, but also increases the memory used. We are not affecting/deleting the already existing
logit_cache
variable because it plays another role and, in any case, it should be always small enough to not matter much.Please be aware that, by default, this PR sets
use_cache
toTrue
.By doing a manual test (see also the ipynb in this PR), we can notice a >6x improvement for 300 tokens max length. This diff is expected to grow as more tokens are generated (without
use_cache=True
, generation will become slower and slower as more tokens are generated).