Open SuTanTank opened 5 years ago
To parallelize a scatter like that, you probably want to either rfactor it (see tutorial lesson 18), or use the atomic() scheduling directive, which lets you parallelize rvars for things like addition even if there's a race by using atomic ops.
Thank you, I checked out the lesson 18 and it's very helpful. However I have another question:
My scatter is to modify the output pixel according to input pixel, thus it's better to use an in-place pipeline. But I tried use undef<T>()
as pure definition of the output function and some update rules, such as output(x, y) += 1
. But it seems the output is always set to 0 by the pure definition and the result is always 1 no matter what the original value is.
So my question is, is there a proper way to implement an in-place pipeline?
Update: This happens when the output is a tuple.
// example
Func tuple;
Func output_1, output_2
Var x, y;
output_1(x, y) = undef<float>();
output_2(x, y) = undef<float>();
r(0,100, 0, 100);
output1(r.x, r.y) += 0.1f;
output2(r.x, r.y) += 0.1f;
tuple(x, y) = Tuple(output_1(x, y), output_2(x, y));
// result output_1 and output_2 are all 0.1f;
After some testing, here is an example that reproduce an unexpected result, which is not because of the use of Tuple, but a extra output wrapper. So maybe Tuple
can't be used with undef<T>()
?
auto width = 10;
auto height = 10;
Var x, y;
Halide::Func foo("foo");
foo(x, y) = Halide::undef<float>();
Halide::RDom r(0, width, 0, height);
foo(r.x, r.y) += 0.1f;
Halide::Func output;
output(x, y) = foo(x, y);
Halide::Buffer<float> ones = Halide::lambda(x, y, 1.f).realize(width, height);
output.realize(ones);
And the result is like this:
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 ... 2.0 2.1 2.2 2.3 ... 3.0 ... 9.1 9.2 9.3 ... 10
rather than this:
1.1 1.1 1.1 ... 1.1 1.1 1.1 ... 1.1 ... ... 1.1 ... 1.1
Use the Tuple will get similar result.
I'm trying to implement an algorithm that use a "splatting" pattern, where given a input image
input(x, y)
, a few output pixelsoutput(outx(x, y) + r.x, outy(x, y) + r.y)
is going to be modified accordingly, andr
is aRDom
representing a local window, lets say, 5x5.I managed to get it working by a update function, with a 4d
RDom(0, width, 0, height, -2, 5, -2, 5)
but it runs slowly and I have no idea how to schedule it properly. In C++, I could just process every input pixel with some obvious parallelism and it's intuitive. But in Halide, it seems I have to schedule it on output domain?Could you give some suggestions for reference to any code examples?