Closed eigen2017 closed 10 months ago
I think you've got the latency including the LLM encoding. sentencepiece tokenization should be fast enough.
Encoding 50738 English words takes only 40 msec. (this includes model initialization steps, so usually it is much faster)
prompt length 300,encode latency come to 100ms,any improvement planed? cuda version maybe a way. thks.