merrymercy / tvm-mali

Optimizing Mobile Deep Learning on ARM GPU with TVM
http://tvmlang.org/2018/01/16/opt-mali-gpu.html
MIT License
179 stars 28 forks source link

about winograd transform matrix #6

Closed janboeye closed 6 years ago

janboeye commented 6 years ago

hi, @merrymercy

in conv2d.py, winograd algorithm do G and B like normal matrix multiplication, this will not reduce multiplication by plus/minus. Is this understood correct?

Thanks

merrymercy commented 6 years ago

No. I transformed this constant matrix by a utility function const_array and unrolled all the transform matrix multiplication. Then TVM's simplify pass will remove zero elements. You can print lowered ir to confirm. I posted the performance comparison in https://github.com/dmlc/tvm/pull/898

janboeye commented 6 years ago

@merrymercy

Thanks for explanation. Could you explain that how const_array could help to remove zero?

produce G {
  for (i, 0, 4) {
    for (j, 0, 3) {
      G[((i*3) + j)] = select(((i == 3) && (j == 2)), 1.000000f, select((((i == 3) && (j == 1)) || ((i == 3) && (j == 0))), 0.000000f, select(((i == 2) && (j == 2)), 0.500000f, select(((i == 2) && (j == 1)), -0.500000f, select((((i == 2) && (j == 0)) || (((i == 1) && (j == 2)) || (((i == 1) && (j == 1)) || ((i == 1) && (j == 0))))), 0.500000f, select((((i == 0) && (j == 2)) || (((i == 0) && (j == 1)) || !((i == 0) && (j == 0)))), 0.000000f, 1.000000f))))))
    }
  }
}

I get the IR, but do not understand how TVM could remove all zero in this select statement.

Thanks

merrymercy commented 6 years ago

You should lower the whole kernel transform https://github.com/dmlc/tvm/blob/5d53f0f9ecb490245f8dba542437b5b70b7ba87d/topi/python/topi/mali/conv2d.py#L566-L571 and unroll axes eps, nu, r_kh, r_kw. Then these select expression will be simplified

janboeye commented 6 years ago

@merrymercy

I write lower code like following

    # transform kernel
    s[G].compute_inline()
    eps, nu, k, c, kk, = s[U].op.axis
    r_kh, r_kw = s[U].op.reduce_axis
    s[U].reorder(k, c, kk, eps, nu, r_kh, r_kw)
    _ = [s[U].unroll(x) for x in [eps, nu, r_kh, r_kw]]
    print "transform kernel lower"
    su = tvm.create_schedule(s[U].op)
    print(tvm.lower(su, [kernel, G], simple_mode=True))

but I got following IR:

produce G {
  for (i, 0, 4) {
    for (j, 0, 3) {
      G[((i*3) + j)] = select(((i == 3) && (j == 2)), 1.000000f, select((((i == 3) && (j == 1)) || ((i == 3) && (j == 0))), 0.000000f, select(((i == 2) && (j == 2)), 0.500000f, select(((i == 2) && (j == 1)), -0.500000f, select((((i == 2) && (j == 0)) || (((i == 1) && (j == 2)) || (((i == 1) && (j == 1)) || ((i == 1) && (j == 0))))), 0.500000f, select((((i == 0) && (j == 2)) || (((i == 0) && (j == 1)) || !((i == 0) && (j == 0)))), 0.000000f, 1.000000f))))))
    }
  }
}
produce U {
  for (eps, 0, 4) {
    for (nu, 0, 4) {
      for (k, 0, 256) {
        for (c, 0, 1280) {
          for (kk, 0, 4) {
            U[((((((((eps*4) + nu)*256) + k)*1280) + c)*4) + kk)] = 0.000000f
            for (r_kh, 0, 3) {
              for (r_kw, 0, 3) {
                U[((((((((eps*4) + nu)*256) + k)*1280) + c)*4) + kk)] = (U[((((((((eps*4) + nu)*256) + k)*1280) + c)*4) + kk)] + ((weight[(((((((k*5120) + c) + (kk*1280))*3) + r_kh)*3) + r_kw)]*G[((eps*3) + r_kh)])*G[((nu*3) + r_kw)]))
              }
            }
          }
        }
      }
    }
  }
}

Could you help to check why my lower command is not right?

But the generated cuda code already removed zero in G and s[G].compute_inline is necessary.

Thanks

janboeye commented 6 years ago

I modify my lower code like following

    print "transform kernel lower"
    su = tvm.create_schedule(s[U].op)
    su[G].compute_inline()
    eps1, nu1, k1, c1, kk1 = su[U].op.axis
    r_kh1, r_kw1 = su[U].op.reduce_axis
    su[U].reorder(k1,c1, kk1, eps1, nu1, r_kh1, r_kw1)
    _ = [su[U].unroll(x) for x in [eps1, nu1, r_kh1, r_kw1]]
    print(tvm.lower(su, [kernel, G], simple_mode=True))

I could get the right IR.

Thanks