Closed xuyxu closed 1 year ago
Hey @xuyxu, it's cool to hear that you're integrating functorch with Ensemble-PyTorch.
The CPU results are expected. PyTorch's convolution kernels are not optimized for CPU; changing the kernel can lead to different performance characteristics. vmap ends up changing the convolution call into another convolution call that seems to be unfortunately slower.
For CUDA: what gpu are you benchmarking on? Is there an easy way for us to repro your results? Using your input sizes, and without the Ensemble-PyTorch library, I compared the vmap ensembling approach to a for-loop ensembling approach on an A100 GPU, and it looks to be significantly faster (https://gist.github.com/zou3519/98e69289ba28f80247039723d073ef07). Though I'm not completely sure this is what your code is doing under the hood.
(pt1.13) [0] rzou@a100-st-p4d24xlarge-55:~ $ python foo.py
<torch.utils.benchmark.utils.common.Measurement object at 0x7f386b623f70>
vmap_inference()
setup: from __main__ import vmap_inference
860.66 us
1 measurement, 1000 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7f386b6237c0>
forloop_inference()
setup: from __main__ import forloop_inference
2.11 ms
1 measurement, 1000 runs , 1 thread
Thanks @zou3519. We will first add the vectorize
API, and stay tuned.
Hi,
After noticing this nice package from the release note of pytorch, we are making our efforts to include it into our repo Ensemble-Pytorch, a member of the pytorch ecosystem focusing on state-of-the-art ensemble methods.
Following the introduction on model ensembling, here is our code snippet on runtime benchmarking. The snippet trains 5 simple LeNet5 models on CIFAR-10, and checks the runtime on
test_loader
using functorch and the originalforward
method.The result is kind of strange:
The performance gain is marginal compared to the official document. I will appreciate it very much if anyone could tell me where goes wrong. Thanks!