Open astariul opened 4 years ago
Thanks for the question! We have not run a detailed analysis on the inference speed, but it is slower than normal inference because of the gradient based updates to the activations. We are working on an extension that alleviates some of this, but it does get slower with an increased number of gradient updates.
(not an issue or resolution, just a note)
I'm also super grateful you've open-sourced this! It's a very creative approach to perturb the past and rerun iteratively.
I've productionized this, figured I'd share some learnings:
In short, running this setup in production is tough; you can get decent speeds (5+ words per second with smaller gpt2 models on a GPU), but concurrent calls will queue since your flask server only has one worker.
To directly answer the question, if I understand this code correctly, the performance impact is (1 + num_iterations) times greater than simply calling the model as-is. That's making the simplification assumption that the model predict function is 100% of the total inference time.
Thanks for open-sourcing the code !
This approach is very interesting, but I'm curious about the impact on performance (inference speed).
Is there any benchmark showing the impact on performance with different parameters ?