Optimized topological sorting for the graph

kparichay commented 3 years ago

Given a graph with multiple independent paths between two nodes, topological sort can return multiple solutions. All these solutions are valid by themselves. However, they can result in different peak and average memory consumption.

Consider the example below graph (character represents node in the graph): resnet_bottleneck_block *the image has been borrowed from https://arxiv.org/pdf/1812.01187v2.pdf.

Assume the Input requires 100units of memory (T1 = 100).

Node L applies 1x1 convolution keeping the required memory for its output memory requirements the same (T2 = 100).
Node M applies convolution with stride 2 reducing the feature size by 4x while maintaining the number of convolutions. The corresponding output memory size reduces by 4x (T3=25).
Node N applies convolution with 4x output channels bringing output memory requirement back to 100 (T4=100).
Node P applies convolution with stride 2 and 4x output channels, keeping output memory requirements the same (T5=100).
Finally, the element-wise add would operate in-place and reuse the memory, making its output memory requirements 0. (T6=0).

Now, consider two topological sorts:

L -> M -> N -> P
L -> M -> P -> N

Both the above sorts are valid. However, the peak memory requirements are different for both the sorting. Peak memory requirements for Sort 1 is 300, while it's only 225 for Sort 2. Note: this is only for inference. Training can have very different memory requirements.

Note that this explains only 1 case in ResNet architecture, and there are more cases in ResNet itself. There are many more scenarios:

With other architectures
While training This allows a lot of memory optimization which can be done to reduce the peak memory consumption of the model execution and training.

Solution

Need to find the topological sort which reduces the peak memory consumption, given a model graph (can start by optimizing for specific models and later, optimize for a generic graph) and mode of execution (inference and training).

Calculation Notes:

Calculating peak memory requirements for Sort 1:	Node operating	Tensors to store	Memory requirement
L	T1, T2	100 + 100	200
M	T1, T2, T3	100 + 100 + 25	225
N	T1, T3, T4	100 + 25 + 100	225
P	T1, T5, T4	100 + 100 + 100	300
Add	T4, T5, T6	100 + 100 + 0	300

Calculating peak memory requirements for Sort 2:	Node operating	Tensors to store	Memory requirement
L	T1, T2	100 + 100	200
M	T1, T2, T3	100 + 100 + 25	225
P	T1, T5, T3	100 + 100 + 25	225
N	T5, T3, T4	100 + 25 + 100	225
Add	T4, T5, T6	100 + 100 + 0	225

taos-ci commented 3 years ago

:octocat: cibot: Thank you for posting issue #1126. The person in charge will reply soon.

lhs8928 commented 3 years ago

[Report] Peak memory of resnet50

Analyze of torchvision resnet50 model peak memory consumption (refer: https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py)

In resnet there are 2 types of bottleneck which contains a downsample or not. In order to reduce peak memory by reorder the layer in forwarding, the bottleneck which contains a downsample layer is only concerned. And in resnet50, there are 4 bottlenecks that contains a downsample layer.

By reordering we can reduce memory consumption during processing the forward pass except first bottleneck. But the memory consumption of first bottleneck is bigger than the rest bottleneck so it seems that peek memory will not reduce even though reordering the layer in forward process.

jijoongmoon commented 3 years ago

I think even if we cannot reduce the peak memory for resnet50, this kind of optimization is always required. It definitely has meaning. We can reduce the memory consumption at certain time of the inference. We also have to calculate during training as well.

kparichay commented 3 years ago

@lhs8928 Thanks for your help and insights. @lhs8928 @jijoongmoon Let's think if training optimizations can be done in a similar fashion.

nnstreamer / nntrainer

Optimized topological sorting for the graph #1126

Solution

Calculation Notes: