Open IceAge666 opened 6 years ago
One thing that would really help is if you could run your code with the debug
target flag and record the timing numbers reported. That will show where the time is going. (It is probably the cuda load module call or equivalent, but confirming this helps.) Nvidia's stack does try to cache the compilations so if the same kernel is used again, even in a different process, it should be faster. However we recently had to turn off the caching on our Windows buildbots as it was causing spurious failures.
Adding one's own caching likely requires placing part of the GPU runtime. Halide is designed to allow this, but it is a fairly sophisticated undertaking. We've talked about adding hooks to make caching easier but have yet to figure out the right design. (Partly because making it really work probably involves adding a concept of "plug-ins" at compile time. These would usually just be command lines that get executed, but it is still a bit of work.)
Thank you for your reply, In fact, I want to apply it in a real-time process on mobile, so this compilation cost affects the overall performance greatly...Considering that the algorithm itself is much more faster than the kernel compilation. BTW, when I set the target as "arm-32-android-opencl" and run it on mobile, it's 20X slower than pc(almost 1 s on mobile, same schedule and both opencl), as for the time of initializaiton, it's 3.08 s, how could this be? The mobile I use is Qualcomm adreno 512, as for the pc, it's Nvidia Quadro K620
Going by rough numbers pulled off the web in five minutes, the K620 is about 3-4X peak number of ALU ops vs. the Adreno 512. (There may be another factor of 2 or 4 in there if one is using vectorized ops. The Qualcomm number is "ALU ops" and the Nvidia one is "cores".) The Nvidia GPU consumes 45W which likely is around 20X what the mobile GPU consumes. And Nvidia actually builds GPUs to do compute and has a software stack to back it up where it's anybody's guess what the rest of the industry is up to in this regard. (Especially on mobile, though Nvidia doesn't really do mobile. It's a bit of an open question as to whether that's due to too lack of power or lack of profitability in mobile devices :-))
With regard to the issue, we do need to figure out something about caching, but your first question should be "Can the kernel run fast enough after compilation to be worthwhile?" There may well be issues re: Halide's compilation. E.g. use of buffers vs. textures, etc. We're starting to see some real use of the GPU backends and perhaps even some on mobile so hopefully this will get better.
If you can share details of your code/kernel, that might be helpful.
The basic purpose of the code is interpolation for 4D space, then multiply it to another 3D space, The code is shown below: `
namespace { using namespace Halide; using Halide::ConciseCasts::f32; using Halide::ConciseCasts::i32;
Var x("x"), y("y"), z("z"), c("c");
class Interpolation_4D : public Halide::Generator<Interpolation_4D> {
public:
Input<Buffer<float>> tran_param{"grid", 4};
Input<Buffer<uint8_t >> raw_data{"raw_data", 3};
Output <Func> output{"output", UInt(8), 3};
void generate();
void schedule();
};
void Interpolation_4D::generate() {
const int s_sigma = 4;
const int upsample_factor = 8;
Func raw_data_float("raw_data_float");
raw_data_float(x, y, c) = raw_data(x, y, c)/255.0f;
Func clamp_tran_param("clamp_tran_param");
clamp_tran_param = BoundaryConditions::repeat_edge(tran_param);
Func interpolated("interpolated");
{
Expr big_sigma = s_sigma * upsample_factor;
Expr yf = cast<float>(y) / big_sigma;
Expr yi = cast<int>(floor(yf));
yf -= yi;
Expr xf = cast<float>(x) / big_sigma;
Expr xi = cast<int>(floor(xf));
xf -= xi;
Expr zf = cast<float>(raw_data(x, y, c))/ big_sigma;
Expr zi = cast<int>(zf);
zf = zf - zi;
interpolated(x, y, c) = lerp(lerp(lerp(clamp_tran_param(xi, yi, zi, c), clamp_tran_param(xi+1, yi, zi, c), xf),
lerp(clamp_tran_param(xi, yi+1, zi, c), clamp_tran_param(xi+1, yi+1, zi, c), xf), yf),
lerp(lerp(clamp_tran_param(xi, yi, zi+1, c), clamp_tran_param(xi+1, yi, zi+1, c), xf),
lerp(clamp_tran_param(xi, yi+1, zi+1, c), clamp_tran_param(xi+1, yi+1, zi+1, c), xf), yf), zf);
}
output(x, y, c) = cast(UInt(8), clamp((interpolated(x, y, 4 * c + 0) * (raw_data_float(x, y, 0))
+ interpolated(x, y, 4 * c + 1) * (raw_data_float(x, y, 1))
+ interpolated(x, y, 4 * c + 2) * (raw_data_float(x, y, 2))
+ interpolated(x, y, 4 * c + 3)), 0.0f, 1.0f)* 255.0f);
Var xo, xi, yo, yi, zo, zi;
//output_y.compute_root().gpu_tile(x, y, xi, yi, 8, 16);
//interpolated.compute_at(output_y, xi);
//output_uv.compute_root().reorder(c, x, y).bound(c, 0, 2).gpu_tile(x, y, xi, yi, 8, 16);
//interpolated_uv.compute_at(output_uv, xi).unroll(c);
output.compute_root().reorder(c, x, y).bound(c, 0, 3).unroll(c)
.gpu_tile(x, y, xi, yi, 8, 16);
}
void Interpolation_4D::schedule() {
}
}
HALIDE_REGISTER_GENERATOR(Interpolation_4D, SuperInterpolation);
int main(int argc, char **argv) { return Halide::Internal::generate_filter_main(argc, argv, std::cerr); } `
To get a library, in command line:
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/Halide/bin
export DYLD_LIBRARY_PATH=${DYLD_LIBRARY_PATH}:/opt/Halide/bin
g++ main_upload.cpp -g -std=c++11 -O2 -fno-rtti -I /opt/Halide/include -L /opt/Halide/bin -lHalide -lpthread -ldl -o SuperInterpolation
./SuperInterpolation -g SuperInterpolation -o . target=arm-32-android-opencl
There it is, the data I sent in is quite big, raw data refers to a photo with high resolution, normally 3024*4032, so I intended to use halide opencl realizaiton to make it real-time on mobile. So far, when I input a small picture, it takes about 70 ms(pc, cuda and opencl), then let it operate on a big image it is fast(pc). on mobile, this costs to much. Thank you again for reading my codes:-)
Hi, I have created myGeneraotr.a and myGenerator.h, the target I set was: target=host-cuda-cuda_capability_50, However, when I use this generator, it seems it always takes a long time to initialize/compilet the kernel, and even more slower when I set the target to opencl. So, I wonder if there is any choice that I can pre-compile or pre-initialize this generator explicitly, or just accelate it? Thanks if anybody could reply.