Closed coodoo closed 8 years ago
Hey! So, looking at the cuda version, there are two implementations. One using thrust, one without:
if (sliceSize <= 2048) {
// Fill `indices` (the values) with the
// slice-relative index.
THCudaTensor_fillSliceWithIndex(state, indices, dim);
// We sort k/v pairs in-place; copy unsorted input to output
THCudaTensor_copy(state, sorted, input);
// Sort using our in-place k/v kernel that supports arbitrary
// layout
THCudaTensor_sortKeyValueInplace(state, sorted, indices, dim, order);
} else {
// Otherwise, fall back upon Thrust, which handles all other cases
// (potentially slowly, with extra copies/memory allocations)
sortViaThrust(state, sorted, indices, input, dim, (bool) order);
}
https://github.com/torch/cutorch/blob/master/lib/THC/THCTensorSort.cu#L447-L462
thrust isn't available on opencl, so we'd need to implement it ourselves somehow, I mean the sort, but it looks like there is already a cuda kernel available for slices up to size 2048, that would plausibly be straigthforward to port to opencl. To what extent would slice sizes up to 2048 meet your requirements?
(Edit: I think slice size means, how many elements do you want to sort:
THC_API void THCudaTensor_sort(THCState* state,
THCudaTensor *sorted,
THCudaTensor *indices,
THCudaTensor *input,
int dim, int order) {
THAssert(THCudaTensor_checkGPU(state, 3, sorted, indices, input));
THCCheckTensorDims(state, sorted, 2);
THCCheckTensorDims(state, indices, 3);
THCCheckTensorDims(state, input, 4);
// Make sure sufficient output space is allocated
THCudaTensor_resizeAs(state, sorted, input);
THCudaTensor_resizeAs(state, indices, input);
// How large are the slices that we are sorting?
long totalElements = THCudaTensor_nElement(state, input);
long sliceSize = THCudaTensor_size(state, input, dim);
// We're using THCudaTensor to write out indices, so if the slice
// size that we're sorting has more elements than can be
// represented in fp32, warn the user
// FIXME: this isn't a real restriction of either our code or of
// Thrust, but we have to switch to a CUDA long tensor to support
// larger slice sizes. Otherwise the indices will contain garbage.
THArgCheck(sliceSize <= (long) FLOAT32_MAX_CONSECUTIVE_INT, 5,
"The sort dimension exceeds single-precision float "
"consecutive integer precision size (2^24), since float "
"is used for indices");
)
Haha, fortunately this is a sentiment analysis task which only has 2 final classes (happy or sad, 1 or 2), and each batch has 10 sentences resulting in [torch.Tensor 10x2]
(does this mean sorting 20 elements?), if so, 2048 should be more than enough 😆
I implemented sort, but for now it only works if you sort along the first dimension :-P I was hoping that would be enough, but re-reading your comment, seems you probably need it to sort along the second dimension?
(actually, I'm not even sure it works for outermost dim. i think I need to do a bit more testing/debugging)
Here's how the output looks like, then I sort it to get the final classification results by reading idx[i][1]
.
local outputs = mlp:forward(inputs[1]):exp()
0.9999 0.0001
0.0137 0.9862
0.0002 0.9998
0.9999 0.0001
0.0000 1.0000
0.0059 0.9941
1.0000 0.0000
0.9984 0.0015
0.9949 0.0051
0.0019 0.9980
[torch.FloatTensor of size 10x2]
-- then sort the outputs to get final result by reading idx
local p, idx = torch.sort( outputs, true )
Two questions:
exp()
then sort()
?outputs:float()
to move the it (which is a lot smaller) to CPU then sort it?Instead of moving the whole model back to CPU, could I just do outputs:float() to move the it (which is a lot smaller) to CPU then sort it?
Yes :-)
Is there other way to get the final classification results other than exp() then sort()
If you want top-5, probably not. If you want top-1, then you can use max
(by the way, not sure that you need to do :exp()
? since, exp is monotonic, so if exp(a) > exp(b)
, then a > b
?
Haha, actually I'm curious about exp
too, was just following coursera's code sample here.
Ps. I thought the purpose of exp
is to make something negative become positive?
Ps. I thought the purpose of exp is to make something negative become positive?
It has that effect, though I'm not sure that is the purpose. max
works on negative numbers too: it returns the most positive term. So max of -2 and -1, is -1. max of -2, 1, and -1, is 1
Yep that is reasonable doubt, it's just I saw a couple of similar examples doing it that way (a combination of exp
with sort
so I thought that was the formal way of doing it, would really love to know if there's any other solution? 😁
I dunno. I always use max :-D eg, see https://github.com/hughperkins/pytorch/blob/master/examples/luamodel/torch_model.lua#L135
hey that's nice, just verified max
is indeed faster and easier than exp
and sort
, let alone everything could be done on GPU, thanks for pointing out the right direction!
Btw, max
also rendered this issue irrelevant, please feel free to close it if you don't have the bandwidth to fix sort
in opencl.
Ok great!
Note: sort
is working now :-)
I tried to run following test script to make predictions with a opencl-trained model:
but got this error message
torch.ClTensor does not implement the torch.sort()
, current workaround is to move the model back to CPU usingmlp:float()
but it quickly become rather tedious, just wondering if there's better way to do it? Maybe a pure OpenCL way of sorting?