Open olegchomp opened 7 months ago
I have no immediate plans to do this myself, but if someone wants to submit a pull request I'd be happy to work together to support faster inference.
I have been experimenting with tranformer kv caching on this branch which does speed things up for longer generations. I haven't merged this into main yet because I need to find time to do more thorough testing (warning: there could be bugs).
@jthickstun have you considered quantisation of the model for a perf improvement? I suppose the main downside is the output results may get a little dicey
Thank you for great repo! Will be great to have somekind of acceleration for ex. TensorRT.