How to maximize performance for single generation?

mbellmbell commented 4 years ago

I am running the 124M model on a V100 GPU and it takes about 6 seconds to execute gpt2.generate(..., length=50, ...) to return a single predictions. If I set nsamples=100, batch_size=100, it returns a list of 100 predictions and takes just about the same 6 seconds.

Is there a more or less fixed setup for each generate() request? How could I use this for something like a chatbot where I need a single response subsecond?

GeoMmm commented 4 years ago

I'm doing something similar. Chatbot channels usually give you 5 seconds before they timeout. In any case the solution I'm using is to reply "wait a sec" and then reply to this message with the real reply.

Just a workaround.

George

Get Outlook for Android

On Mon, Mar 23, 2020 at 8:44 PM +0100, "Michael Brill" notifications@github.com wrote:

I am running the 124M model on a V100 GPU and it takes about 6 seconds to execute gpt2.generate(..., length=50, ...) to return a single predictions. If I set nsamples=100, batch_size=100, it returns a list of 100 predictions and takes just about the same 6 seconds.

Is there a more or less fixed setup for each generate() request? How could I use this for something like a chatbot where I need a single response subsecond?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

mbellmbell commented 4 years ago

I'm doing something similar. Chatbot channels usually give you 5 seconds before they timeout. In any case the solution I'm using is to reply "wait a sec" and then reply to this message with the real reply.

Thanks. Sounds like a challenging UX. I've seen sub-second response times using the Hugging Face implementation (https://convai.huggingface.co/), but for our current POC, would like to stick with gpt-2-simple.

minimaxir commented 4 years ago

Huggingface Transformers generates slightly differently, but better; I am using it for my next gen package of text generation. Unfortunately, those tricks won't work here.

mbellmbell commented 4 years ago

@minimaxir I'm playing with GPT2 transformers and it seems that performance is pretty linear with length. length=10 is 1 second, length=100 is 10 seconds sort of thing. WIth gpt-2-simple it seems like there is a fixed overhead and then length has a more modest impact. At ~ length==100, they seem to be about same performance for single sample.

Does that make sense?

mbellmbell commented 4 years ago

@GeoMmm @minimaxir fwiw, final generation is ~ 300ms on a single V100 on a fine-tuned small model with gpt2 using transformers/pytorch.

minimaxir / gpt-2-simple

How to maximize performance for single generation? #188