Closed pseudotensor closed 1 year ago
chat=False streaming=True:
225 tokens, 1:52 -> 2tokens/s . So seems to be all streaming fault, not chat.
chat=True, streaming=False:
243 tokens, 1:06 -> 3.7tokens/sec. So all seems to be streaming fault, as worried about.
However, odd is that things get progressively slower, even without processing enabled. But processing is still occurring as I can see the code is handled. So may still be slow generation due to style checks. I'll try to disable all style checking.
No sanitization of response, back to chat=stream=True
243 tokens in 1:09 -> 3.5tokens/s
So most of issue is sanitization, i.e. profanity filter.
original issue, supposedly fixed: https://github.com/gradio-app/gradio/issues/4092
streaming may be blocker, seems to get slower near end of generation, or may be above only.
Maybe can batch too in gradio: https://gradio.app/setting-up-a-demo-for-maximum-performance/#the-max_batch_size-parameter but probably still not continuous batching.
Inference API GM: https://gpt-gm.h2o.ai/conversation/648815a35842dbd369f8407e
Seems to scale very well with concurrent requests, using continuous batching. For just one user, also 4tokens/second on A100 40GB:
chat=False/streaming=False, A6000 gets 3.5tokens/second. But streaming=chat=True gets 2.2tokens/second:
Maybe should just have gradio call HF inference server via client: https://github.com/huggingface/text-generation-inference/tree/main/clients/python#hugging-face-inference-endpoint-usage
or via langchain llm: https://python.langchain.com/en/latest/modules/models/llms/integrations/huggingface_pipelines.html