TorchScript Performance: 150x gap between TorchScript and Native Python

divyekapoor commented 4 years ago

🐛 Bug

There's a 150x gap in performance for TorchScript ops versus straight Python / C++. Looping over 100K numbers takes 2+ seconds instead of 18ms or better. Please see the benchmarks here: https://github.com/divyekapoor/ml-op-benchmarks

To Reproduce

https://github.com/divyekapoor/ml-op-benchmarks

Steps to reproduce the behavior:

Clone the repo
make torchbench

See related TensorFlow issue for context: https://github.com/tensorflow/tensorflow/issues/34500

Expected behavior

FizzBuzz Iteration Counts	100000
	Raw Latency (ms)	Per Run Latency (usec)	Python Multiplier	C++ Multiplier
PyTorch Python	4007	40.07	222.61	23851
PyTorch TorchScript Python (from Loaded TorchScript)	2830	28.3	157.22	16845
PyTorch TorchScript C++ (Native)	255	2.55	14.17	1518
PyTorch TorchScript C++ (Native + ATen Tensors)	252	2.52	14.00	1500
Raw Python	18	0.18	1.00	107
Raw C++	0.168	0.00168	0.01	1

Performance similar to raw Python is the expected behavior.

Environment

Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

$ python3 /tmp/collect_env.py
Collecting environment information...
PyTorch version: 1.3.0.post2
Is debug build: No
CUDA used to build PyTorch: None

OS: Mac OSX 10.14.6
GCC version: Could not collect
CMake version: version 3.15.5

Python version: 3.7
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.17.4
[pip3] torch==1.3.0.post2
[conda] Could not collect

PyTorch Version (e.g., 1.0): 1.3
OS (e.g., Linux): Mac OS X
How you installed PyTorch (conda, pip, source): pip
Build command you used (if compiling from source): NA
Python version: 3.7
CUDA/cuDNN version: NA
GPU models and configuration: NA
Any other relevant information: See performance tables and github repo.

Additional context

Code:

class TorchFizzBuzz(torch.nn.Module):
    def __init__(self):
        super(TorchFizzBuzz, self).__init__()
        self.fizz = torch.tensor(0, requires_grad=False)
        self.buzz = torch.tensor(0, requires_grad=False)
        self.fizzbuzz = torch.tensor(0, requires_grad=False)

    def forward(self, n: torch.Tensor):
        i = torch.tensor(0, dtype=torch.int32, requires_grad=False)
        self.fizz = torch.zeros(1)
        self.buzz = torch.zeros(1)
        self.fizzbuzz = torch.zeros(1)
        while i < n:
            if i % 6 == 0:
                self.fizzbuzz += 1
            elif i % 3 == 0:
                self.buzz += 1
            elif i % 2 == 0:
                self.fizz += 1
            i += 1
        return torch.stack([self.fizz, self.buzz, self.fizzbuzz])

cc @suo

Evpok commented 4 years ago

To be fair, while it can obviously be done, forward having side effects (here setting attributes) is not the most common use case

divyekapoor commented 4 years ago

@Evpok Even without the side effects, the performance gap is consistent, just check out: https://github.com/divyekapoor/ml-op-benchmarks and change the code if you'd prefer:

Outcomes:

Time (PyTorch) (ms):  4097.000675
Time (PyTorch optimized=True) (ms):  3982.672392
Time (PyTorch optimized=False) (ms):  4017.969171
Time (PyTorch from Loaded) (ms):  2879.079591
Time taken (Python3) (ms):  18.797112

Code:

class TorchFizzBuzz(torch.nn.Module):
    def __init__(self):
        super(TorchFizzBuzz, self).__init__()

    def forward(self, n: torch.Tensor):
        i = torch.tensor(0, dtype=torch.int32, requires_grad=False)
        fizz = torch.zeros(1)
        buzz = torch.zeros(1)
        fizzbuzz = torch.zeros(1)
        while i < n:
            if i % 6 == 0:
                fizzbuzz += 1
            elif i % 3 == 0:
                buzz += 1
            elif i % 2 == 0:
                fizz += 1
            i += 1
        return torch.stack([fizz, buzz, fizzbuzz])

divyekapoor commented 4 years ago

@Evpok From the discussion with the Tensorflow folks (tensorflow/tensorflow#34500), we generated a NumPy baseline if that would be preferable.

FizzBuzz Iteration Counts	100000
	Method Latency (ms)	Iteration Latency (usec)	Python Multiplier	C++ Multiplier
Tensorflow Python	4087	40.87	*227.06*	24327
Tensorflow Saved Model Python	4046	40.46	*224.78*	24083
Tensorflow Python no Autograph	3981	39.81	221.16	23696
PyTorch Python	4007	40.07	222.61	23851
PyTorch TorchScript Python (from Loaded TorchScript)	2830	28.3	157.22	16845
NumPy Python	420	4.2	*23.3*	2500
PyTorch TorchScript C++ (Native)	255	2.55	14.17	1518
PyTorch TorchScript C++ (Native + ATen Tensors)	252	2.52	14.00	1500
Raw Python	18	0.18	1.00	107
Raw C++	0.168	0.00168	0.01	1

xsacha commented 4 years ago

Why is the numpy version faster than torchscript C++ per iteration but slower for 100,000 iterations? Seems like there's a factor of 10 out for one of those. Is iteration meant to be 4.2us?

divyekapoor commented 4 years ago

Yes. Fixed. Thanks for pointing it out.

divyekapoor commented 4 years ago

A colleague also got a Torchscript vectorized implementation set up with an ~8ms baseline (beating the 18ms from Python). So that would be something to think about as a reference implementation. Discussion on the equivalent TF bug is also quite useful - they have some experimental workarounds.

@xsacha @suo Could you indicate the next steps?

suo commented 4 years ago

Hey, @divyekapoor I'd be interested to know the ultimate use case you're benchmarking for.

The reason I ask is that PyTorch is poorly optimized for doing lots of computations on scalar values—as mentioned on the TF issue, these libraries are typically targeted toward doing operations on large tensors, where the per-op overhead is dwarfed by the operator computation itself.

As such, if something fizzbuzz-like is similar to your use case, you're unlikely to get performance comparable to just writing it in C++. In other words, you are paying the cost of using PyTorch (overhead) without benefiting from its core features (autograd, rich tensor library, etc.)

That said, a few thoughts on this particular case:

In cases like this, we typically recommend that you implement the computation in a custom op in C++ and call it from TorchScript/Python (see our tutorial).
We are working on improved CPU kernel fusion and code generation that would theoretically achieve near-native performance on code that looks like this (cc @jamesr66a @mruberry), but there is no timeline for when that will be available.

divyekapoor commented 4 years ago

Thanks for the detailed reply @suo ! Our usecase is cross features for some of our online serving models where the features cannot be prematerialized (think UserContext x ImageFeatures where both users and images are large sets O(millions/billions)). Think about these powering something like the Instagram feed / Pinterest feed.

For the purposes of our discussion, assume an LR model where Users have some topic affinities and Images have some topic affinities. Both are sparse vectors of the form { topic: weight }. The cross feature is the dot product after some sanitization, thresholding, normalization, boosts. The dot product itself is easy to vectorize but everything around it is regular control flow (eg. Boost feature value by 2x if both sides have more than 3 matches from a given list [a, b, c], if there are more than 7 matches from this other list, reduce by 0.5, one feature might be counts matching a hardcoded feature subset etc.).

To be clear, this is a hypothetical illustrative example. The actual cross features are currently in straight custom C++ in our serving binary written by model engineers and are quite varied. However, given the User and Image inputs, we’d like the serving binary to never know how to generate these crosses (it should all be part of model code). Model engineers can then be more productive (no C++) and the infra simplifies (everyone is dealing with non cross features as inputs even though these inputs may be somewhat complex eg. Maps). Similarly on training, the cross features can be backfilled or tuned (again, no materialized cross features). The end goal is some non trivial cross features produced in-model during execution using some light control flow ops without lots of overhead.

(Written on mobile, happy to add more context in a bit)

ngimel commented 4 years ago

Is using tensor as a loop counter and using tensor for control flow operations necessary/makes benchmark more representative? TorchScript is fine with python numbers, and rewriting the benchmarked function as

class TorchFizzBuzz(torch.nn.Module):
    def __init__(self):
        super(TorchFizzBuzz, self).__init__()

    def forward(self, n: int):
        i = 0
        fizz = torch.zeros(1)
        buzz = torch.zeros(1)
        fizzbuzz = torch.zeros(1)
        one = torch.ones(1)
        while i < n:
            if i % 6 == 0:
                fizzbuzz += one
            elif i % 3 == 0:
                buzz += one
            elif i % 2 == 0:
                fizz += one
            i += 1
        return torch.stack([fizz, buzz, fizzbuzz])

gives

Time (PyTorch from Loaded) (ms):  135.470804

xsacha commented 4 years ago

I think the point of this issue is to illustrate how much slower a tensor is so that such bottlenecks can be avoided.

Yes you could replace it with an int, but the point is using a tensor to count should be similar in speed.

No one here is using fizzbuzz, we use much more complicated models that exhibit the same issues but are too complex to post here and identify control flow as being a slow down due to the other complexities in the model (such as matrix multiplication).

ngimel commented 4 years ago

I don't think that using tensor to count should be similar in speed - it would be a nice bonus if it were, but no one is making you use tensors everywhere. Tensors should be used where it makes sense, and not used where it does not. Also, torch native c++ benchmarks listed here don't use tensors for loops and control flow https://github.com/divyekapoor/ml-op-benchmarks/blob/master/torch_fizz.cc#L17-L37, so in that sense it's not an apples-to-apples comparison.

divyekapoor commented 4 years ago

@xsacha @ngimel - I've updated the benchmarks to address @ngimel 's comments on apples to apples.

The C++ API now has one setup with a Tensor based loop and the other one with a native loop counter.

Point to note: The benchmark is to illustrate that Tensor based ops are hundreds of times slower than just basic Python code. Even at 135ms, the Pytorch version of the program is 10x slower than writing raw Python.

The root cause is a slow Tensor class (illustrated by the fact that the torch::Tensor based loop takes 2700 ms to complete but a native counter loop takes just 200ms (and raw C++ takes just usecs)).

I'm not sure what's driving the slowness with the torch::Tensor class - what would be the best way to investigate?

Apples to Apples links: https://github.com/divyekapoor/ml-op-benchmarks/blob/master/torch_fizz.cc#L17-L37

eliasffyksen commented 2 years ago

EDIT: After rereading the issue I'm not sure how related this is, but maybe someone here can point me in the right direction

I'm also interested in this issue. I'm relatively new to PyTorch so take everything I'm saying with a grain of salt, but thought I'd post my stuff here in case someone knows what's going on because I don't.

I'm currently looking at this because I'm doing RL and writing simple RL environments in TorchScripts since it is easier for rapid development/prototyping than writing my environments in C++.

I've written this little benchmark for looking at different approaches: https://github.com/eliasffyksen/RL-env-bench/blob/main/main.py

So far this is what I've found: performance graph

Short explanation of the code

The code generates a grid world environment, the model just picks randomly from the different actions (still, up, down, left, right). All the code is in the repo main.py. Batch size is the number of time steps to emulate in the environment

Different graphs

Raw Python: Just run the model in normal python
local jit from python: Create jit script and run it from python
local jit from C++ extension: Pass jit script as a parameter to C++ extension function and run from there
file jit from C++ extension: Save jit to file, use C++ extension to run torch::jit::load to load the model from file, then run the model that was compiled by C++ from another C++ extension function.
file jit from C++ extension: Use the same jit script file, but load it and run it in a standalone C++ executable with libtorch.

A few notes:

No model compiling was taken into the timing measurements in any of the tests
Each model was run a couple of times before tested to make sure all jit was compiled before timing it

What surprises me is that there is such a big difference which seems to scale with batch size between running it in C++ extension and standalone C++ executable even if they both use torch::jit::load to load the same jit script from file.

Are they actually using different torch::jit::load functionality, or could it be that something about python interrupting the process at some intervals? If latter, could this be the case when running a C++ extension generally?

I will try rewriting the environment in C++ and execute it in a C++ extension as well as in a standalone C++ executable while calling the same model to see if there is any discrepancy between those. I'll post here if you guys are interested in the results.

pytorch / pytorch