Open mbellmbell opened 4 years ago
I'm doing something similar. Chatbot channels usually give you 5 seconds before they timeout. In any case the solution I'm using is to reply "wait a sec" and then reply to this message with the real reply.
Just a workaround.
George
Get Outlook for Android
On Mon, Mar 23, 2020 at 8:44 PM +0100, "Michael Brill" notifications@github.com wrote:
I am running the 124M model on a V100 GPU and it takes about 6 seconds to execute gpt2.generate(..., length=50, ...) to return a single predictions. If I set nsamples=100, batch_size=100, it returns a list of 100 predictions and takes just about the same 6 seconds.
Is there a more or less fixed setup for each generate() request? How could I use this for something like a chatbot where I need a single response subsecond?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
I'm doing something similar. Chatbot channels usually give you 5 seconds before they timeout. In any case the solution I'm using is to reply "wait a sec" and then reply to this message with the real reply.
Thanks. Sounds like a challenging UX. I've seen sub-second response times using the Hugging Face implementation (https://convai.huggingface.co/), but for our current POC, would like to stick with gpt-2-simple.
Huggingface Transformers generates slightly differently, but better; I am using it for my next gen package of text generation. Unfortunately, those tricks won't work here.
@minimaxir I'm playing with GPT2 transformers and it seems that performance is pretty linear with length. length=10 is 1 second, length=100 is 10 seconds sort of thing. WIth gpt-2-simple it seems like there is a fixed overhead and then length has a more modest impact. At ~ length==100, they seem to be about same performance for single sample.
Does that make sense?
@GeoMmm @minimaxir fwiw, final generation is ~ 300ms on a single V100 on a fine-tuned small model with gpt2 using transformers/pytorch.
I am running the 124M model on a V100 GPU and it takes about 6 seconds to execute gpt2.generate(..., length=50, ...) to return a single predictions. If I set nsamples=100, batch_size=100, it returns a list of 100 predictions and takes just about the same 6 seconds.
Is there a more or less fixed setup for each generate() request? How could I use this for something like a chatbot where I need a single response subsecond?