Performances - Githubissues

astariul commented 4 years ago

Thanks for open-sourcing the code !

This approach is very interesting, but I'm curious about the impact on performance (inference speed).

Is there any benchmark showing the impact on performance with different parameters ?

dathath commented 4 years ago

Thanks for the question! We have not run a detailed analysis on the inference speed, but it is slower than normal inference because of the gradient based updates to the activations. We are working on an extension that alleviates some of this, but it does get slower with an increased number of gradient updates.

erik-dunteman commented 3 years ago

(not an issue or resolution, just a note)

I'm also super grateful you've open-sourced this! It's a very creative approach to perturb the past and rerun iteratively.

I've productionized this, figured I'd share some learnings:

because the need to record and backpropagate gradients, it's not possible (as far as I know) to serve inference from a web based serving engine such as Tensorflow serving, as the .backward() method passes through the model itself, and gradient taping doesn't work over distributed system calls such as REST or gRPC
It must be ran directly as is (loading the model directly into the application, and calling with model()).
On a g4dn.2xlarge instance, I have an XL GPT2 model running, wrapped in the flask server. With that setup, and with num_iterations=3, I'm pulling an average of 0.6s per word. This drops with fewer iterations, and drops dramatically if serving a smaller parameter model.
Because of model size in memory, it's not feasible to run multiple workers in flask, as you'd load redundant models into memory, proportional to the quantity of workers

In short, running this setup in production is tough; you can get decent speeds (5+ words per second with smaller gpt2 models on a GPU), but concurrent calls will queue since your flask server only has one worker.

erik-dunteman commented 3 years ago

To directly answer the question, if I understand this code correctly, the performance impact is (1 + num_iterations) times greater than simply calling the model as-is. That's making the simplification assumption that the model predict function is 100% of the total inference time.

uber-research / PPLM

Performances #8