halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.9k stars 1.07k forks source link

ParamMap causes performance drop of 50% #2731

Closed mcourteaux closed 1 year ago

mcourteaux commented 6 years ago

In my application, all scenario's running at 8 threads, produce these timings:

Results are consistent. So it seems that using the ParamMap causes the slowdown, and not necessarily reusing the resulting compiled code. Of course, I first compiled (all) the code on another thread, before starting with multithreading.

For completeness, I transformed this global Params code:

internal->input_pi.set(pi);
internal->input_c.set(normalization);
internal->input_llt.set(Buffer<float>(Q_buffer, num, Dims));

Buffer<float> w_buffer(output_w_buffer, num);
internal->func_weights.realize(w_buffer);

into

Buffer<float> llt_buffer(Q_buffer, num, Dims);
ParamMap params;
params.set(internal->input_pi, pi);
params.set(internal->input_c, normalization);
params.set(internal->input_llt, llt_buffer);

Buffer<float> w_buffer(output_w_buffer, num);
Target t = get_jit_target_from_environment();
internal->func_weights.realize(w_buffer, t, params);

where the internal struct contains this:

Halide::Param<float> input_c{"input_c"};
Halide::Param<float> input_pi{"input_pi"};
Halide::ImageParam input_llt{Halide::type_of<float>(), 2, "input_llt"};
Halide::Func func_weights{"weights"};
zvookin commented 6 years ago

You'll need to provide a small, complete example that reproduces the issue. I expect something else is wrong. Here is an example that indicates param_map performance is as intended: https://gist.github.com/zvookin/d9b5fb11326f5f1075bfbe59015375a3 .

zvookin commented 6 years ago

Any further info? Would love to help, but don't have any thoughts on how these results could come about.

mcourteaux commented 6 years ago

I was working on something else for some time, I'll get back to this soon. Thank you very much for your concern! I've been digging through the code of to find where ParamMap is used, and what could potentially cause the issue, but I can't see anything.

Not sure if I will be able to construct a minimal example, but I'll try.

On 15 Feb 2018, at 20:51, Zalman Stern notifications@github.com wrote:

Any further info? Would love to help, but don't have any thoughts on how these results could come about.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/halide/Halide/issues/2731#issuecomment-366041740, or mute the thread https://github.com/notifications/unsubscribe-auth/AAzk1KSx5U9T8nbj6EtPIyr4kNo_b_tQks5tVIq8gaJpZM4R9lG6.

mcourteaux commented 6 years ago

In the meanwhile, here is a copy-paste of the pipeline. Not sure if it will be of any use. It calculates the logarithm of multivariate gaussian distribution probabilities.

void HalideMultivariateGaussian__Internal::compile(size_t Dims) {
    /*
     * LLT Buffer: (sample, dimension)
     * dot(Q, Q, 2): (sample)
     *
     * LLT:
     *      samples--->
     * dim  s1_1 s2_1 s3_1 s4_1 s5_1 s6_1 s7_1 s8_1
     *   |  s1_2 s2_2 s3_2 s4_2 s5_2 s6_2 s7_2 s8_2
     *   |  s1_3 s2_3 s3_3 s4_3 s5_3 s6_3 s7_3 s8_3
     *   |  s1_4 s2_4 s3_4 s4_4 s5_4 s6_4 s7_4 s8_4
     *
     * Dot:
     *      s1   s2   s3   s4   s5   s6   s7   s8
     */
    using namespace Halide;

    vecsize =
        Halide::get_target_from_environment().natural_vector_size<float>();

    printf("Prepare Expectation...\n");
    Var x{"sample"};

    Func dot{"dot"};
    RDom dotDom{0, (int)Dims};
    dot(x) = undef<float>();
    dot(x) = select(dotDom.x == 0, Expr(float(0.0)), dot(x)) +
             input_llt(x, dotDom.x) * input_llt(x, dotDom.x);

    Expr e = (input_c + dot(x)) * Expr(float(-0.5));

    // fast_exp only works for float32.
    if (Halide::type_of<float>() == Float(32)) {
        func_weights(x) = fast_exp(e) * input_pi;
    } else {
        func_weights(x) = exp(e) * input_pi;
    }

    /* Schedule */
    input_llt.dim(0).set_min(0);
    input_llt.dim(0).set_stride(1);
    input_llt.dim(1).set_bounds(0, (int)Dims);

    func_weights.output_buffer().dim(0).set_min(0);

    dot.update().reorder(dotDom.x, x).unroll(dotDom.x).vectorize(x, vecsize);

    Var dot_xo{"dot_xo"}, dot_xi{"dot_xi"};
    func_weights.split(x, dot_xo, dot_xi, vecsize);
    dot.compute_at(func_weights, dot_xo);
    func_weights.vectorize(dot_xi);
    func_weights.store_root();

    printf("JIT Compile Expectation...");
    func_weights.compile_jit();
    printf("  done\n");

    // func_weights.print_loop_nest();
    func_weights.compile_to_lowered_stmt("expectation.html",
                                         {input_c, input_pi, input_llt}, HTML);
}

The Dims equals 5 in my scenario. And here is the call to it:

void HalideMultivariateGaussian::pdf(float pi, float normalization,
                                     float *Q_buffer, float *output_w_buffer,
                                     size_t num) {
    if (num % internal->vecsize != 0) {
        throw std::runtime_error(
            "Use a minibatch_size which is a multiple of " +
            std::to_string(internal->vecsize));
    }

    using namespace Halide;

    Buffer<float> llt_buffer(Q_buffer, num, Dims);
    ParamMap params;
    params.set(internal->input_pi, pi);
    params.set(internal->input_c, normalization);
    params.set(internal->input_llt, llt_buffer);

    /*
    internal->input_pi.set(pi);
    internal->input_c.set(normalization);
    internal->input_llt.set(Buffer<float>(Q_buffer, num, Dims));
    */
    Buffer<float> w_buffer(output_w_buffer, num);
    Target t = get_jit_target_from_environment();
    internal->func_weights.realize(w_buffer, t, params);
}

If you magically have enough information with this, then awesome! If not, I'll try to provide a minimal working example. PS: I have 4 cores (x2 hyper threads): Intel Core i7 2.8GHz.

zvookin commented 6 years ago

One thing to try is to set the HL_DEBUG_CODEGEN environment variable to e.g. 1 or 2 and see if you're getting output indicating a recompile on each call to realize with the param_map. If you are compiling your own Halide, putting a debug print in Pipeline::invalidate_cache() might be illustrative.

zvookin commented 6 years ago

May be the same issue as https://github.com/halide/Halide/issues/2759 . The object instance registry machinery seems to start taking more and more time and then there's a crash, likely due to running out of memory. Though that would require a lot of ParamMap manipulation first. (Steven and I will fix it so ParamMap doesn't interact with the object registry in this way.)

eldakms commented 6 years ago

Maybe this info will be of interest. I am running 15 pipelines that are separately compiled (on a machine with 24 CPUs logical CPUs), sharing only readonly buffers. There is enough computation and not so many stores in the function graph.

I also see a significant slowdown in comparison with a single threaded scenario (more than 2 times). Profiling picture shows a lot of synchronization between threads (red parts below). local parammap

In particular ParamMap causes quite some locking. I need to recompile Halide myself in order to get the debug symbols to pinpoint what actual Halide functions lead to these locks (probably as already mentioned it is the object registry and this will be fixed soon), unfortunately official build does not have pdbs.

One workaround that worked for me was to have ParamMap/ImageParam as a member variable, not local one. Here is the picture without creation on each realize: parammap as member

That gave me significant boost. As you can see, there is still some synchronization you would not expect to have for independent pipelines though.

mcourteaux commented 6 years ago

Very nicely done! So, the speedup you see comes from the local ParamMap. What exactly do you do? I guess this, but please correct me if I'm wrong: create per thread a ParamMap once, and store it somewhere, such that the same thread-unique ParamMap objects keeps living across multiple invocations of the pipeline?

eldakms commented 6 years ago

Yep, your statement should be correct.

Though it is a bit more complicated for me: I cannot really reuse the pipeline between different threads due to a crash described here So I keep a single ParamMap per pipeline.

I think all this should be unnecessary when Zalman fixes the object registry access. It was more an info rather than actual solution :)

zvookin commented 6 years ago

The locking is almost certainly the object registry which should be eliminated in master. (This is why thread safe smart pointers without compiler knowledge are problematic.)

Generally a ParamMap should be created on stack right before calling realize, though there will be a small amount of overhead in building a map so keeping the ParamMap alive per thread or over a bunch of calls in a loop might be noticeably faster for filters that don't take a lot of processing time themselves.

I'll leave the issues open until we can confirm fixes, but there's a good chance everything is fixed in master now.

mcourteaux commented 6 years ago

Can't confirm the fix. I download nightly today (trunk-mac64). I have compared performance of paramMap implementations only with the previous version of halide and the nightly from today. Performance gain with new version is only 3% to 6%. So nowhere near the original 50% drop I experienced. I will try the suggestion of @eldakms.

mcourteaux commented 6 years ago

Just implemented the suggestion of @eldakms: create one ParamMap per thread, compile once, and reuse this ParamMap bound to the thread, for several executions: 15% performance increase. Still not where we came from. 😕

mcourteaux commented 1 year ago

As per your latest commit, this might be closed I guess? @steven-johnson

steven-johnson commented 1 year ago

Yes, thanks! (Fixed by https://github.com/halide/Halide/pull/7675)