Open shockjiang opened 6 years ago
Couldn't you at least count along rows in parallel and then do a serial reduction on the resulting 1D vector?
Halide::RDom strip(0, 16);
Halide::Func binarized("binarized");
binarized(x,y) = Halide::select( input(x, y) > 10, 1, 0);
Halide::Func binarizedWithBC = Halide::boundaryconditions::constant_exterior(binarized, 0, 0, imageWidth, 0, imageHeight);
Halide::Func count1, count2, count3.;
count1(x, y) = Halide::sum(binarizedWithBC (strip.x + 16*x, y));
count1.parallel(y);
count2(x, y) = Halide::sum(count1(strip.x + 16*x, y));
count2.parallel(y);
count3(x, y) = Halide::sum(count2(strip.x + 16*x, y));
count3.parallel(y);
Halide::RDom finalR(0, imageWidth / 16/ 16/ 16+ 1, 0, imageHeight);
Halide::Expr finalCount = Halide::sum( count3(finalR.x, finalR.y));
Crude probably but fast enough.
The parallel histogram section in the Scheduling FAQ shows a way similar to what @MDBrothers mentioned - you could adapt that too (Since you essentially need 'one bucket')).
I agree with @ashishUthama . The halide scheduling example as well as the associative reduction tutorial http://halide-lang.org/tutorials/tutorial_lesson_18_parallel_associative_reductions.html I think are related to your problem. I'm always pleasantly surprised when almost anything reasonable works, scheduling wise.
I'm trying to define a global shared variable, which counts a non-zero elements in a input, like this:
counter = 0 N += select(input[x,y] > 10, 1, 0)
however, this is very hard in halide, is there any global shared variable that targets this goal?