hughperkins / cltorch

An OpenCL backend for torch.
Other
291 stars 26 forks source link

Workaround for torch.sort()? #62

Closed coodoo closed 8 years ago

coodoo commented 8 years ago

I tried to run following test script to make predictions with a opencl-trained model:

mlp:cl()
local outputs = mlp:forward(inputs[1]):exp()
local probs, idx = torch.sort( outputs, true )

but got this error message torch.ClTensor does not implement the torch.sort(), current workaround is to move the model back to CPU using mlp:float() but it quickly become rather tedious, just wondering if there's better way to do it? Maybe a pure OpenCL way of sorting?

hughperkins commented 8 years ago

Hey! So, looking at the cuda version, there are two implementations. One using thrust, one without:

  if (sliceSize <= 2048) {
    // Fill `indices` (the values) with the
    // slice-relative index.
    THCudaTensor_fillSliceWithIndex(state, indices, dim);

    // We sort k/v pairs in-place; copy unsorted input to output
    THCudaTensor_copy(state, sorted, input);

    // Sort using our in-place k/v kernel that supports arbitrary
    // layout
    THCudaTensor_sortKeyValueInplace(state, sorted, indices, dim, order);
  } else {
    // Otherwise, fall back upon Thrust, which handles all other cases
    // (potentially slowly, with extra copies/memory allocations)
    sortViaThrust(state, sorted, indices, input, dim, (bool) order);
  }

https://github.com/torch/cutorch/blob/master/lib/THC/THCTensorSort.cu#L447-L462

thrust isn't available on opencl, so we'd need to implement it ourselves somehow, I mean the sort, but it looks like there is already a cuda kernel available for slices up to size 2048, that would plausibly be straigthforward to port to opencl. To what extent would slice sizes up to 2048 meet your requirements?

(Edit: I think slice size means, how many elements do you want to sort:

THC_API void THCudaTensor_sort(THCState* state,
                               THCudaTensor *sorted,
                               THCudaTensor *indices,
                               THCudaTensor *input,
                               int dim, int order) {
  THAssert(THCudaTensor_checkGPU(state, 3, sorted, indices, input));
  THCCheckTensorDims(state, sorted, 2);
  THCCheckTensorDims(state, indices, 3);
  THCCheckTensorDims(state, input, 4);

  // Make sure sufficient output space is allocated
  THCudaTensor_resizeAs(state, sorted, input);
  THCudaTensor_resizeAs(state, indices, input);

  // How large are the slices that we are sorting?
  long totalElements = THCudaTensor_nElement(state, input);
  long sliceSize = THCudaTensor_size(state, input, dim);

  // We're using THCudaTensor to write out indices, so if the slice
  // size that we're sorting has more elements than can be
  // represented in fp32, warn the user
  // FIXME: this isn't a real restriction of either our code or of
  // Thrust, but we have to switch to a CUDA long tensor to support
  // larger slice sizes. Otherwise the indices will contain garbage.
  THArgCheck(sliceSize <= (long) FLOAT32_MAX_CONSECUTIVE_INT, 5,
             "The sort dimension exceeds single-precision float "
             "consecutive integer precision size (2^24), since float "
             "is used for indices");

)

coodoo commented 8 years ago

Haha, fortunately this is a sentiment analysis task which only has 2 final classes (happy or sad, 1 or 2), and each batch has 10 sentences resulting in [torch.Tensor 10x2] (does this mean sorting 20 elements?), if so, 2048 should be more than enough 😆

hughperkins commented 8 years ago

I implemented sort, but for now it only works if you sort along the first dimension :-P I was hoping that would be enough, but re-reading your comment, seems you probably need it to sort along the second dimension?

hughperkins commented 8 years ago

(actually, I'm not even sure it works for outermost dim. i think I need to do a bit more testing/debugging)

coodoo commented 8 years ago

Here's how the output looks like, then I sort it to get the final classification results by reading idx[i][1].

local outputs = mlp:forward(inputs[1]):exp()
 0.9999  0.0001
 0.0137  0.9862
 0.0002  0.9998
 0.9999  0.0001
 0.0000  1.0000
 0.0059  0.9941
 1.0000  0.0000
 0.9984  0.0015
 0.9949  0.0051
 0.0019  0.9980
[torch.FloatTensor of size 10x2]

-- then sort the outputs to get final result by reading idx
local p, idx = torch.sort( outputs, true )

Two questions:

  1. Is there other way to get the final classification results other than exp() then sort()?
  2. Instead of moving the whole model back to CPU, could I just do outputs:float() to move the it (which is a lot smaller) to CPU then sort it?
hughperkins commented 8 years ago

Instead of moving the whole model back to CPU, could I just do outputs:float() to move the it (which is a lot smaller) to CPU then sort it?

Yes :-)

hughperkins commented 8 years ago

Is there other way to get the final classification results other than exp() then sort()

If you want top-5, probably not. If you want top-1, then you can use max

hughperkins commented 8 years ago

(by the way, not sure that you need to do :exp()? since, exp is monotonic, so if exp(a) > exp(b), then a > b ?

coodoo commented 8 years ago

Haha, actually I'm curious about exp too, was just following coursera's code sample here.

Ps. I thought the purpose of exp is to make something negative become positive?

hughperkins commented 8 years ago

Ps. I thought the purpose of exp is to make something negative become positive?

It has that effect, though I'm not sure that is the purpose. max works on negative numbers too: it returns the most positive term. So max of -2 and -1, is -1. max of -2, 1, and -1, is 1

coodoo commented 8 years ago

Yep that is reasonable doubt, it's just I saw a couple of similar examples doing it that way (a combination of exp with sort so I thought that was the formal way of doing it, would really love to know if there's any other solution? 😁

hughperkins commented 8 years ago

I dunno. I always use max :-D eg, see https://github.com/hughperkins/pytorch/blob/master/examples/luamodel/torch_model.lua#L135

coodoo commented 8 years ago

hey that's nice, just verified max is indeed faster and easier than exp and sort, let alone everything could be done on GPU, thanks for pointing out the right direction!

Btw, max also rendered this issue irrelevant, please feel free to close it if you don't have the bandwidth to fix sort in opencl.

hughperkins commented 8 years ago

Ok great!

hughperkins commented 8 years ago

Note: sort is working now :-)