jiazhihao / TASO

The Tensor Algebra SuperOptimizer for Deep Learning
Apache License 2.0
687 stars 90 forks source link

the optimized graph haven't BN layer and it's different with the source graph in onnx #31

Open XiaotaoChen opened 4 years ago

XiaotaoChen commented 4 years ago

script

python examples/test_onnx.py -f convert_mx_onnx/mx_resnet18.onnx

I run above script with provided docker image. The onnx file convert_mx_onnx/mx_resnet18.onnx is my resnet18 converted from mxnet. the source graph as below: source graph

But the optimized graph is different with the source graph obviously, which haven't BN layer and have multiple outputs

optimized source graph

the graph created by examples/resnext50.py is normal. which haven't BN layer. Has BN been merged into Conv layer ? Can anyone explain this ? thanks.

another problem in examples/test_onnx.py

error info as below:

Traceback (most recent call last):
  File "examples/test_onnx.py", line 16, in <module>
    print(" original_cost = {}".format(graph.cost()))
AttributeError: 'taso.core.PyGraph' object has no attribute 'cost'
jiazhihao commented 4 years ago

@XiaotaoChen Thanks for your interests in TASO. TASO has an optimization that merge a conv and a following BN into a single conv (by changing the weights in the conv), so it is expected that you didn't see BN layers in the output graph. But having multiple outputs seems to be a potential bug --- can you provide a reference to your onnx file for us to debug?

For the error in examples/test_onnx.py, have you installed the most up-to-date version of TASO. cost is a member function of taso.core.PyGraph: https://github.com/jiazhihao/TASO/blob/master/python/taso/_cython/core.pyx#L170-L171. So it is a bit surprising you encountered that error.

XiaotaoChen commented 4 years ago

@jiazhihao Thanks. The error of graph.cost should be my env's problem. It's normal with your docker image. And the onnx file is resnet18.onnx In addition, I checked the outputs between the source onnx file and optimized onnx file. To find when group==1, the output of the two is same. But when group>1, the optimized result is wrong obviously. the result of the first block of resnext50 as below. It shows the conv whose kernel is [4,2, 3, 3] converted to conv whose kernel is [4, 4, 3, 3]. and visualized graph is here: source and optimized

        cost[Conv2D]: i(1 3 28 28) w(64 3 7 7) s(2 2) p(0) cost(0.0269) total_cost(0.0269)
        cost[Pool2D]: i(1 64 14 14) k(3 3) s(2 2) cost(0.0043) total_cost(0.0312)
        cost[Conv2D]: i(1 64 7 7) w(4 64 1 1) s(1 1) p(0) cost(0.0165) total_cost(0.0477)
        cost[Conv2D]: i(1 4 7 7) w(4 2 3 3) s(1 1) p(0) cost(0.0232) total_cost(0.0709)
        cost[Conv2D]: i(1 4 7 7) w(8 4 1 1) s(1 1) p(0) cost(0.0131) total_cost(0.0840)
        cost[Conv2D]: i(1 64 7 7) w(8 64 1 1) s(1 1) p(0) cost(0.0165) total_cost(0.1005)
        cost[Element]: cost(0.0049) total_cost(0.1054)
        cost[Activation]: mode(8) cost(0.0059) total_cost(0.1113)
        Cost metrics: exe_time(0.1113) flops(0.0072) memory_access(0.5561) kernel_launches(8)

        ===== Start Cost-Based Backtracking Search =====
        [0] cost = 0.1113 bestCost = 0.1113 candidates.size() = 0
        [1] cost = 0.0944 bestCost = 0.0944 candidates.size() = 2
        [2] cost = 0.0829 bestCost = 0.0829 candidates.size() = 2
        [3] cost = 0.0948 bestCost = 0.0829 candidates.size() = 1
        [4] cost = 0.0999 bestCost = 0.0829 candidates.size() = 0
        ===== Finish Cost-Based Backtracking Search =====

        cost[Conv2D]: i(1 3 28 28) w(64 3 7 7) s(2 2) p(0) cost(0.0269) total_cost(0.0269)
        cost[Pool2D]: i(1 64 14 14) k(3 3) s(2 2) cost(0.0043) total_cost(0.0312)
        cost[Conv2D]: i(1 4 7 7) w(8 4 1 1) s(1 1) p(0) cost(0.0131) total_cost(0.0443)
        cost[Element]: cost(0.0049) total_cost(0.0492)
        cost[Activation]: mode(8) cost(0.0059) total_cost(0.0551)
        cost[Conv2D]: i(1 64 7 7) w(12 64 1 1) s(1 1) p(0) cost(0.0160) total_cost(0.0712)
        cost[Split]: numOutputs(2) cost(0.0000) total_cost(0.0712)
        cost[Conv2D]: i(1 4 7 7) w(4 4 3 3) s(1 1) p(0) cost(0.0117) total_cost(0.0828)
        Cost metrics: exe_time(0.0828) flops(0.0072) memory_access(0.5094) kernel_launches(7)
jiazhihao commented 4 years ago

@XiaotaoChen I have pushed a fix for an incorrect substitution in TASO. The bug should have been fixed in commit b1d88347. Please reinstall the TASO runtime by running sudo make install under the build folder.

Converting convolution kernel from [4, 2, 3, 3] to [4, 4, 3, 3] is a substitution in TASO that changes the group number in a convolution (This is a valid substitution because you can always use a convolution with fewer groups to mimic a convolution with more groups). TASO discovers that reducing the group number achieves better performance on that GPU device.

XiaotaoChen commented 4 years ago

@jiazhihao Thanks, I know your means about reducing the group number. the same as Figure 9 (a, b, c) in your paper. For example:

source conv: 
input data: (1, 64, 128, 128)
group=32
weight: (64, 2, 3, 3)
output = Convolution(input_data, weight, group)
---------------------------------------------
the optimized conv:
split input and weight into 4 partitions. each partition data as belows:
sub_input data: (1, 16, 128, 128)
sub_group = 8
sub_weight: (16, 2, 3, 3)
sub_outputi = Convolution(sub_input_data, sub_weight, sub_group)
output = concat([sub_output0, ..., sub_output3])

So the optimized conv is equal to source conv. And in my issue the conv with group=2, kernel=(4, 2, 3, 3). reducing group number to 1, the optimized conv should as below(assume input data is (1, 4, 14, 14)):

split input data into 2 partition: each partition data as below:
sub_input_data: (1, 2, 14, 14)
sub_group=1
sub_weight: (2, 2, 3, 3)
sub_outputi = Convolution(sub_input_data, sub_weight, sub_group=1)
output = concat([sub_output0, sub_output1])

But The optimized conv with group=2, kernel=(4, 2, 3, 3) becomes conv with group=1, kernel=(4, 4, 3, 3). The parameter number and result of the two both are not equal. Did i miss understanding your means ? I still don't know why the two is equal.

jiazhihao commented 4 years ago

The optimization for grouped convolution is slightly different from your understanding. For a convolution with group=2, kernel = (4, 2, 3, 3), TASO may transform it to a convolution with group=1, kernel=(4, 4, 3, 3). Note that this introduces additional parameters (from 4233=72 to 4433=144), but TASO can preserve mathematical equivalence by setting the extra weights to zeros. This is a potential optimization for some grouped convolutions because cuDNN has a much better implementation for specific group sizes.

XiaotaoChen commented 4 years ago

Thanks for your explanation. @jiazhihao I compared parameters between the source group conv(4,2,3,3) and optimized normal conv(4,4,3,3) by converting onnx to mx.model. Found that all parameters of normal conv(4,4,3,3) are 0. And the source parameter of group conv is normal. The other parameters is consistent between source parameters and optimized parameters. There may be some unknown bugs in converting onnx to mx model.

jiazhihao commented 4 years ago

@XiaotaoChen The issue should have been fixed in commit bb1219898. It would be great if you can rerun your model and verify that the bug is fixed.