pytorch / executorch

On-device AI across mobile, embedded and edge for PyTorch
https://pytorch.org/executorch/
Other
2.12k stars 350 forks source link

Memory Planning Issue/Optimization (for "concat" style operations) #3210

Open rvijayc opened 6 months ago

rvijayc commented 6 months ago

Would it be possible to enhance the existing memory planner to cover situations where intelligent use of memory offsets can eliminate unnecessary copies? I am posting a minimal example below:

import torch
import torch.nn as nn
import torch.export

# author a model.
class Example(nn.Module):
    def __init__(self):
        super(Example, self).__init__()

    def forward(self, x):
        x_sin = torch.sin(x)
        x_cos = torch.cos(x)
        x_combined = torch.cat([x_cos, x_sin])
        out = torch.sigmoid(x_combined)
        return out

model = Example().eval()

# generate a uniform distribution of data.
n_batches = 100
# generate some example input.
x_in = torch.distributions.uniform.Uniform(-1, 1).sample([1, 64])

# export the model.
m_export = torch.export.export(model, (x_in[0,:],))

# convert to edge.
from executorch.exir import EdgeProgramManager, to_edge
m_edge: EdgeProgramManager = to_edge(m_export)

from executorch.exir import ExecutorchBackendConfig, ExecutorchProgramManager
from executorch.exir.passes import MemoryPlanningPass
executorch_program: ExecutorchProgramManager = m_edge.to_executorch(
    ExecutorchBackendConfig(
        passes=[],  # User-defined passes
        memory_planning_pass=MemoryPlanningPass(
            "greedy"
        ),  # Default memory planning pass
    )
)

This produces a graph like below:

etorch_module

And a corresponding memory allocation like below:

+-------+------------+----------------------+----------+---------+---------+-----------+--------+
|   idx | lifetime   | name                 |   mem_id | shape   | dtype   |   mem_ofs |   size |
|-------+------------+----------------------+----------+---------+---------+-----------+--------|
|     0 | [0, 4]     | x                    |        1 | [64]    | float32 |         0 |    256 |
|     1 | [1, 6]     | alloc                |        1 | [64]    | float32 |       512 |    256 |
|     2 | [1, 6]     | aten_sin_default     |        1 | [64]    | float32 |       512 |    256 |
|     3 | [3, 6]     | alloc_1              |        1 | [64]    | float32 |      1024 |    256 |
|     4 | [3, 6]     | aten_cos_default     |        1 | [64]    | float32 |      1024 |    256 |
|     5 | [5, 8]     | alloc_2              |        1 | [128]   | float32 |         0 |    512 |
|     6 | [5, 8]     | aten_cat_default     |        1 | [128]   | float32 |         0 |    512 |
|     7 | [7, 9]     | alloc_3              |        1 | [128]   | float32 |       512 |    512 |
|     8 | [7, 9]     | aten_sigmoid_default |        1 | [128]   | float32 |       512 |    512 |
+-------+------------+----------------------+----------+---------+---------+-----------+--------+

There are two issues I see with this example:

  1. The memory offset of aten.sin should have been 256 instead of 512. I am not sure why we have a hole of 256 bytes between x and aten.sin. The alignment (defaults to 16) is already met even if we use 256.
    • A useful datapoint is that the gap seems to be proportional to the number of users of x.
  2. The aten.cat operation can be completely eliminated if the output of aten.cos was directed to offset 256 and the output of aten.sin was directed to offset of 512 (which puts them back to back in memory without requiring an explicit copy done by aten.cat).

In the list above, (1) seems like a bug while (2) is an enhancement request. Please let me know in case I can provide more information.

Version

Collecting environment information... PyTorch version: 2.4.0.dev20240415+cpu Is debug build: False CUDA used to build PyTorch: Could not collect ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: 10.0.0-4ubuntu1 CMake version: version 3.29.2 Libc version: glibc-2.31

Python version: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] (64-bit runtime) Python platform: Linux-5.4.0-146-generic-x86_64-with-glibc2.31 Is CUDA available: False CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: GRID A100D-10C Nvidia driver version: 525.60.13 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] executorch==0.2.1.dev0+unknown [pip3] numpy==1.26.4 [pip3] onnx==1.13.1 [pip3] torch==2.4.0.dev20240415+cpu [pip3] torchaudio==2.2.0.dev20240415+cpu [pip3] torchsr==1.0.4 [pip3] torchvision==0.19.0.dev20240415+cpu [conda] blas 1.0 mkl [conda] cpuonly 2.0 0 pytorch-nightly [conda] executorch 0.2.1.dev0+unknown pypi_0 pypi [conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch-nightly [conda] mkl 2023.1.0 h213fc3f_46344 [conda] mkl-service 2.4.0 py310h5eee18b_1 [conda] mkl_fft 1.3.8 py310h5eee18b_0 [conda] mkl_random 1.2.4 py310hdb19cb5_0 [conda] numpy 1.26.4 py310h5f9d8c6_0 [conda] numpy-base 1.26.4 py310hb5e798b_0 [conda] pytorch 2.4.0.dev20240417 py3.10_cpu_0 pytorch-nightly [conda] pytorch-mutex 1.0 cpu pytorch-nightly [conda] torch 2.4.0.dev20240415+cpu pypi_0 pypi [conda] torchaudio 2.2.0.dev20240415+cpu pypi_0 pypi [conda] torchsr 1.0.4 pypi_0 pypi [conda] torchvision 0.19.0.dev20240415+cpu pypi_0 pypi

JacobSzwejbka commented 6 months ago

@metascroy Recently did something similar with view_copy elimination. I think this can definitely work, though we are being pretty careful about introducing views and aliasing right now by default.

For this scenario theres some risk that introducing the aliasing is not sound. Its almost certainly fine in practice since we have no mutation in the graph today outside of a niche copy_ scenario and the out args of out variant ops. The reason this risks exists is because torch.cat in pytorch isnt a view afaik it copies the memory under the hood.

You could probably write 2passes here to handle this.

  1. Replace the cat ops with a custom cat op that does nothing
  2. a custom memory planning pass though that looks for your special cat op and ensures the args are memory planned accordingly

Id have to think more if theres a better way we could be expressing special nodes that we dont actually want in the runtime though. Maybe the changes @metascroy is making for noop views could be extended to allow user hook up. CC @larryliu0820 @angelayi for thoughts.

JacobSzwejbka commented 6 months ago

The memory offset of aten.sin should have been 256 instead of 512. I am not sure why we have a hole of 256 bytes between x and aten.sin. The alignment (defaults to 16) is already met even if we use 256. A useful datapoint is that the gap seems to be proportional to the number of users of x.

@ydwu4 can you take a look at this

larryliu0820 commented 6 months ago

The memory offset of aten.sin should have been 256 instead of 512. I am not sure why we have a hole of 256 bytes between x and aten.sin. The alignment (defaults to 16) is already met even if we use 256.

This seems like a bug honestly. I would suspect here: https://github.com/pytorch/executorch/blob/main/exir/memory_planning.py#L593-L598 we are annotating the size wrong. Need to debug further though.

ydwu4 commented 6 months ago

Did some check. Input shapes are correct and it's due to the un-optimality of the greedy algorithm. We need to memory planning for 5 tensors: input, sin, cos, concat, softmax. As shown below, each pair is (tensor_allocated_memory, shared obj info).

256, SharedObject(idx=0, offset=0, size=512, last_used_index=8))
(256, SharedObject(idx=1, offset=512, size=512, last_used_index=9))
(256, SharedObject(idx=2, offset=1024, size=256, last_used_index=6))
(512, SharedObject(idx=0, offset=0, size=512, last_used_index=8))
(512, SharedObject(idx=1, offset=512, size=512, last_used_index=9))

The algorithm first allocates 3 256 buffers, then expand the fisrt two 256 buffer to 512 (due to the greedyness in selecting the best fit so far with only local knowledge). So we're getting 512 + 512 + 256 total memory sizes. While the optimimal memory size would be 512(for storing output of cat) + 512 (for storing output of softmax) = 1024. Note that: we're not actually very far from the optimal given that we don't leverage the information that cat operation can actually be combined into one allocation.

rvijayc commented 5 months ago

@metascroy Recently did something similar with view_copy elimination. I think this can definitely work, though we are being pretty careful about introducing views and aliasing right now by default.

For this scenario theres some risk that introducing the aliasing is not sound. Its almost certainly fine in practice since we have no mutation in the graph today outside of a niche copy_ scenario and the out args of out variant ops. The reason this risks exists is because torch.cat in pytorch isnt a view afaik it copies the memory under the hood.

You could probably write 2passes here to handle this.

  1. Replace the cat ops with a custom cat op that does nothing
  2. a custom memory planning pass though that looks for your special cat op and ensures the args are memory planned accordingly

Id have to think more if theres a better way we could be expressing special nodes that we dont actually want in the runtime though. Maybe the changes @metascroy is making for noop views could be extended to allow user hook up. CC @larryliu0820 @angelayi for thoughts.

Thanks for the suggestions. I am not yet sufficiently familiar with the EXIR memory planner, but I was planning to get familiar with it and do exactly what you were suggesting (i.e., subclass the base memory planner and see if I can add hooks for users to handle special situations). Will try this when I get some time, but would be nice if this was a built-in feature (provided all the right conditions are met for such optimization).