transpose causes Apply1 to be 20 times slower

hughperkins commented 8 years ago

transpose causes Apply1 to be 20 times slower. Even though, almost all Apply1 ops are invariant to transpose. Exceptions include eg tril, but are rare. So, with following code:

local time = nil
for it=1,3 do
  print('==== it', it, '==============')

  a = torch.CudaTensor(side, side)
  collectgarbage()
  a = a:t()
  cutorch.synchronize()
  sys.tic()
  -- start, ie assume the tensor arrives transposed
  a:tanh()
  cutorch.synchronize()
  time = sys.toc()
  print('tanh without re-transpose, size', side * side * 4 / 1024 /1024, 'MB =', time, 'seconds')

  a = torch.CudaTensor(side, side)
  collectgarbage()
  a = a:t()
  cutorch.synchronize()
  sys.tic()
  -- start, ie assume the tensor arrives transposed
  a = a:t()
  a:tanh()
  a = a:t()
  cutorch.synchronize()
  time = sys.toc()
  print('tanh with re-transpose, size', side * side * 4 / 1024 /1024, 'MB =', time, 'seconds')

  a = torch.CudaTensor(side, side)
  collectgarbage()
  a = a:t()
  cutorch.synchronize()
  sys.tic()
  -- start, ie assume the tensor arrives transposed
  a:neg()
  cutorch.synchronize()
  time = sys.toc()
  print('neg without re-transpose, size', side * side * 4 / 1024 /1024, 'MB =', time, 'seconds')

  a = torch.CudaTensor(side, side)
  collectgarbage()
  a = a:t()
  cutorch.synchronize()
  sys.tic()
  -- start, ie assume the tensor arrives transposed
  a = a:t()
  a:neg()
  a = a:t()
  cutorch.synchronize()
  time = sys.toc()
  print('neg with re-transpose, size', side * side * 4 / 1024 /1024, 'MB =', time, 'seconds')
end

Output:

==== it 1   ==============
tanh without re-transpose, size 256 MB =    0.67421507835388    seconds
tanh with re-transpose, size    256 MB =    0.036809921264648   seconds
neg without re-transpose, size  256 MB =    0.68109512329102    seconds
neg with re-transpose, size 256 MB =    0.036545038223267   seconds
==== it 2   ==============
tanh without re-transpose, size 256 MB =    0.67690086364746    seconds
tanh with re-transpose, size    256 MB =    0.036709070205688   seconds
neg without re-transpose, size  256 MB =    0.68494701385498    seconds
neg with re-transpose, size 256 MB =    0.036523103713989   seconds
==== it 3   ==============
tanh without re-transpose, size 256 MB =    0.66083979606628    seconds
tanh with re-transpose, size    256 MB =    0.036779880523682   seconds
neg without re-transpose, size  256 MB =    0.68568801879883    seconds
neg with re-transpose, size 256 MB =    0.036549091339111   seconds

Edit: on a K520, discrepancy is even more pronounced:

tanh without re-transpose, size 256     MB =    0.17799401283264        seconds
tanh with re-transpose, size    256     MB =    0.0059909820556641      seconds
neg without re-transpose, size  256     MB =    0.17719101905823        seconds
neg with re-transpose, size     256     MB =    0.0050539970397949      seconds

dominikgrewe commented 8 years ago

I guess it's all about the memory access pattern. When the tensor is transposed, threads in a block effectively access elements from the same column, which results in a memory fetch for each element. Whereas on a non-transposed tensor, the threads read consecutive elements which can be fetched in a single memory access.

hughperkins commented 8 years ago

I guess it's all about the memory access pattern.

Yes, agree. Seems like it should be possible to fix it? Kind of vaguely dabbling at https://github.com/torch/cutorch/compare/master...ASAPPinc:apply1-auto-transpose?expand=1

wickedfoo commented 8 years ago

For apply1 this makes sense. I'd even change it to find the smallest stride dimension and put that first (since even if there are a small number of holes, there might be fewer transactions per warp to load the memory if the stride is small enough, like 2 or so).

I did have the thought that I never acted on to perform all writes (for apply2, 3, etc.) on the most contiguous dimension first, but that requires tons of template instantiations (since we convert the linear index to an index in the write space first, and then figure out how that gets mapped to the read space). Also this would require that we know for apply2 etc. which parameters are read and which are write.

The apply kernels are also not great at the most common case: everything is completely contiguous. When and if I have time to work on cutorch again, I'll add optimizations for this case (float4, increased ILP, etc.)

hughperkins commented 8 years ago

float4

I remember reading that for NVIDIA, float4 no longer gives any speed benefits. A quick google didnt throw up anything, but I'm fairly sure it has come from multiple NVIDIA sources, it's not just like some forum message I saw randomly somewhere. I think...

I did have the thought that I never acted on to perform all writes (for apply2, 3, etc.) on the most contiguous dimension first, but that requires tons of template instantiations

Well, a simple transpose would fix for apply2, 3 too I think? As long as all tensors have the same sizes and strides, which is very often the case.

We're sort of in 'no free lunch' territory here, which is why I started an issue to define a somewhat well-defined set of use-cases we want to target https://github.com/torch/cutorch/issues/328 I suspect there are some use-cases missing there (eg it's missing something that uses LookupTable and sort, for now, I think?). But for example I did some analysis at https://groups.google.com/forum/#!topic/torch7/QCeyF0S0owI , and we have a huuggee number of apply3 expands, even though we almost never use them...

wickedfoo commented 8 years ago

float4 no longer gives any speed benefits

Not in my tests at least (on Tesla K40 and on Maxwell Titan X), it's about 10% faster for many codes. Especially if you add increased ILP into the mix; e.g.,:

constexpr int Unroll = 4;
float4 loads[Unroll];

#pragma unroll
for (int i = 0; i < Unroll; ++i) {
  loads[i] = loadFrom[loadOffset + blockDim.x * i];
}

#pragma unroll
for (int i = 0; i < Unroll; ++i) {
  stores[storeOffset + blockDim.x * i] = f(loads[i]);
}

I think it's a tradeoff between issuing larger memory transactions per each warp versus the warp scheduler finding another warp or CTA to execute in its stead.

Definitely the increased ILP to hide memory latency is a much larger benefit than float4 in my tests (that gives you 20-50%).

re: removing excessive apply template specializations, I think removing all dims == 3 specialization cases makes sense.

I think the 3 case was there to handle transpositions between non-neighboring dimensions in a 4d array:

1000 100 10 1 -> 1000 1 10 100

which could collapse to 3 runs of contiguous dimensions (1000 1) (10) (100)

but I've never seen such a transposition in practice; usually it's neighboring dimensions, something like:

10000 1000 100 10 -> 10000 1000 10 100 => (10000 1000 10) (100) 10000 1000 100 10 -> 10000 100 1000 10 => (10000 100) (1000 10)

Also regarding non-contiguous operations, I've thought about writing an in-place transposition kernel using shared memory, which could be used to convert all non-contig ops with at least one contiguous dimension that is large enough into something to benefit from fewer memory transactions. However, cutorch doesn't provide any guarantees on mutability of read-only operands; there could be some other kernel (in a different stream, or perhaps outside of Torch entirely) that is operating on the data. Non-contiguous writes are also a lot slower than non-contiguous reads, so maybe the inplace transposition could only be performed for the output.

re: sort, we use that a lot internally for NLP-ish tasks. The non-contiguous cases are used a bit more there, but I think the 3 case can be removed too.

torch / cutorch

transpose causes Apply1 to be 20 times slower #323