Significantly improve generation time

Mihaiii commented 2 weeks ago

This change significantly improves generation time, since we always generate one token in a loop, but also increases the memory used. We are not affecting/deleting the already existing logit_cache variable because it plays another role and, in any case, it should be always small enough to not matter much.

Please be aware that, by default, this PR sets use_cache to True.

By doing a manual test (see also the ipynb in this PR), we can notice a >6x improvement for 300 tokens max length. This diff is expected to grow as more tokens are generated (without use_cache=True, generation will become slower and slower as more tokens are generated).

Execution time with use_cache=False (1B model & max_length=300): 60.837812662124634 seconds
Execution time with use_cache=True (1B model & max_length=300): 8.247310400009155 seconds

sam-paech commented 2 weeks ago

Hi Mihai, thanks for the contribution!

I actually had a similar optimisation in the first version, but got rid of it because of a warning that using past key values in that way was being deprecated soon. Were you getting the same warning message with this code?

I'm about 90% finished some fairly major changes & fixes, one of which is to generate for more than 1 token per iteration and use stopping conditions to stop generation. This is much faster and uses the cache without needing to manage it ourselves.

It may still be worth storing a cache like this since we are still doing the generation piecemeal. Let's check in with this again once I've uploaded the latest version.

Mihaiii commented 2 weeks ago

Hi!

Yup, makes sense. I noticed that the class that inherits StoppingCriteria is not actually passed to the Huggingface code for stopping sampling - I'm looking forward to the refactoring that will use it.

To answer your question, I don't remember seeing that warning.

sam-paech commented 2 weeks ago

Ok, I've uploaded the refactor.

Mihaiii commented 2 weeks ago

That is quite some refactoring 😃

I can't try it now, but I looked at the code and it seems that the initial issues should be resolved.

Thanks! Closing! :)

sam-paech / antislop-sampler

Significantly improve generation time #2