Closed Ph0rk0z closed 3 months ago
In all my measurements the dynamic generator has been slightly faster than the old streaming generator at bsz 1 (and of course much faster with bsz 2+). There might be extra overhead because of the new control flow in Tabby, though.
Also it's hard to say if TGW and Tabby are measuring tokens/s exactly the same. A better comparison would be to the previous version of Tabby.
I think if I tried the previous version of tabby, it would use the standard generator. In that case it was slightly faster than textgen in the same way as exllama vs exllama_HF. Currently that's 11.30 vs 11.45, it used to be more. I can also try with the old version, before dynamic gen because I backed it up when updating xformers support. Can also check cuda12 vs 11.8.
Are you using Q4 cache for this?
Yes, Q4 cache always. It's closer now with the latest commits. I see 11s at least.
Trying out dynamic gen in TabbyAPI vs textgen on command-r+
3x3090 with FA.
vs
Am only setting the batch to 1, is this normal?