Open IceAge666 opened 7 years ago
I solved this by divide this stage into 2, the first stage compute over r.x and the 30 numbers, so I can parallel for r.x and 30. The cost time is not so much by this way.
For the update part on cuda, it seems I have found a BUG:
Func box(x, y, z);
box(x, y, z) = 0;
RDom r(0, 30);
box(x, y, z) = select(input(r)==1, 1, box(x, y, z));
the extent for x and y are 296 and 192,
when I schedule it:
box.gpu_tile(x, y, xo, yo, xi, yi, 16, 16)
.update(0)
.gpu_tile(x, y, xo, yo, xi, yi, 16, 16)
it shows an error when I apply this generator:
Output buffer f8 is accessed at 303, which is beyond the max (295) in dimension 0
When I change the gpu_tile of the update(0) to 8*8, this error disappear, and it is noticed that 16 x 19= 304 I also tried by split, it shows the same error.
I guess there must be a BUG for the Generator update part.
Try adding TailStrategy::GuardWithIf to your gpu_tile
Say, I have a buffer contains 30 numbers, and I need to traverse a 4-D function to get the maximum product(or other operations) for this 30 numbers, so I use a RDom to tranverse. A part of my code is
RDom r(0, width, 0, height, 0, channel, 0, depth); results(x) = argmax(r, f(r.x, r.y, r.z, r.w)* orgin(x);
where orgin is the buffer contains 30 numbers. There is no problem for this kind of algorithm, However, It takes me To Much time when the widh and height is too big (400, 200), especially on cuda. When I apply it on CPU, it seems not so bad, but on cuda it's almost 100 ms, So I wonder if there is something that I can use in Halide to parallel this RDom.
I tried by rfactor, but it seems don't work because it's not the kind of associative reduction. Is there any tricks for GPU?
Thank you if any guys knows this.