check on performance of gradio generation

pseudotensor commented 1 year ago

original issue, supposedly fixed: https://github.com/gradio-app/gradio/issues/4092

streaming may be blocker, seems to get slower near end of generation, or may be above only.

Maybe can batch too in gradio: https://gradio.app/setting-up-a-demo-for-maximum-performance/#the-max_batch_size-parameter but probably still not continuous batching.

Inference API GM: https://gpt-gm.h2o.ai/conversation/648815a35842dbd369f8407e

Seems to scale very well with concurrent requests, using continuous batching. For just one user, also 4tokens/second on A100 40GB:

chat=False/streaming=False, A6000 gets 3.5tokens/second. But streaming=chat=True gets 2.2tokens/second:

Maybe should just have gradio call HF inference server via client: https://github.com/huggingface/text-generation-inference/tree/main/clients/python#hugging-face-inference-endpoint-usage

or via langchain llm: https://python.langchain.com/en/latest/modules/models/llms/integrations/huggingface_pipelines.html

pseudotensor commented 1 year ago

chat=False streaming=True:

225 tokens, 1:52 -> 2tokens/s . So seems to be all streaming fault, not chat.

pseudotensor commented 1 year ago

chat=True, streaming=False:

243 tokens, 1:06 -> 3.7tokens/sec. So all seems to be streaming fault, as worried about.

pseudotensor commented 1 year ago

However, odd is that things get progressively slower, even without processing enabled. But processing is still occurring as I can see the code is handled. So may still be slow generation due to style checks. I'll try to disable all style checking.

pseudotensor commented 1 year ago

No sanitization of response, back to chat=stream=True

243 tokens in 1:09 -> 3.5tokens/s

So most of issue is sanitization, i.e. profanity filter.

h2oai / h2ogpt

check on performance of gradio generation #287