halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.89k stars 1.07k forks source link

Pipeline limit #2800

Open liwchang opened 6 years ago

liwchang commented 6 years ago

Hi Halide Experts,

I noticed there is some overhead of pipeline realization, so it might not be a good idea to generate many lightweight pipeline, and manage invocation by myself. Using coarse-grained pipeline might be better in terms of performance.

But I also noticed pipeline seems to have a limit of how large a Func can be. Exceeding that limitation would cause compilation crashing. That limit seems related to the number of compute_root inside.

So I did a small dummy test.

Func ADD(op1, op2)
{ return output(x) = op1(x)+ op2(x)}

temp = op1
for(0:N) {
   temp = ADD(temp, op2)
   temp.compute_root();
}
pipeline(temp)

And I figured out N is around 400, before Halide compilation crash. But if I unroll the loop by 2, N can go 438. I also tried several different Func's. All seem to imply it is related to the number of compute_root/at, not the sizes of buffers. But it is not really perfectly linear to the number of compute_root, so I cannot model it.

In this dummy code, my code is extremely simple. In the real case, each FunC could be vary complicated. And I did see some pipeline did failed, so I have to manually split it to two pipelines.

Is there a guideline to know whether my pipeline is too large? And what is the limit of the pipeline?

Thanks

abadams commented 6 years ago

What do you mean by "crashing" in this context? The thing that'll probably break is running out of stack space inside the simplifier, so you could try compiling your program with a larger stack. But Halide compilation doesn't scale linearly with pipeline size anyway, so very large pipelines should be avoided.

If you're limited by realization overhead, use AOT compilation to avoid it. It doesn't have any of the overhead of the JIT path.

On Wed, Mar 7, 2018 at 8:11 PM, Li-Wen Chang notifications@github.com wrote:

Hi Halide Experts,

I noticed there is some overhead of pipeline realization, so it might not be a good idea to generate many lightweight pipeline, and manage invocation by myself. Using coarse-grained pipeline might be better in terms of performance.

But I also noticed pipeline seems to have a limit of how large a Func can be. Exceeding that limitation would cause compilation crashing. That limit seems related to the number of compute_root inside.

So I did a small dummy test.

Func ADD(op1, op2) { return output(x) = op1(x)+ op2(x)}

temp = op1 for(0:N) { temp = ADD(temp, op2) temp.compute_root(); } pipeline(temp)

And I figured out N is around 400, before Halide compilation crash. But if I unroll the loop by 2, N can go 438. I also tried several different Func's. All seem to imply it is related to the number of compute_root/at, not the sizes of buffers. But it is not really perfectly linear to the number of compute_root, so I cannot model it.

In this dummy code, my code is extremely simple. In the real case, each FunC could be vary complicated. And I did see some pipeline did failed, so I have to manually split it to two pipelines.

Is there a guideline to know whether my pipeline is too large? And what is the limit of the pipeline?

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/halide/Halide/issues/2800, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfdRtWMHgiqGEswMF6kFOzQ4rs7JS9Iks5tcK9agaJpZM4SiHK0 .

liwchang commented 6 years ago

Hi Andrew,

Let me make sure one thing about the overhead first. When I said realization overhead, I mean overhead from .realize function only, and I already did .compile_jit before it. That overhead didn't count jit compilation. In this sense, can AOT compilation still avoid more overhead than jit?

In the previous dummy example, the overhead is about 3500 ns per pipeline (realization). If I do use very fine-grained pipeline, this overhead will become noticeable. In the real applications, we did observe it. Compared to a fine-grained pipeline, a coarse-grained pipeline is more or less like a coarse-grained function containing inlined fine-grained functions. I can expect some advantages if we use it properly.

For the crash, it meets my expectation. But a larger stack can only temporarily solve the current problem, because problems I am solving are getting bigger and bigger. I think it is safer if I have a guideline to avoid this. If I can estimate how far I am reaching the limit. I can split the pipeline before it reaches the simplifier's limit. Do you have suggestion for me to estimate it?

Thanks,

zvookin commented 6 years ago

You can try adding Target::NoAsserts and Target::NoBoundsQuery to the target flags to remove some of the overhead. However, Halide's scheduling advantages decrease for very fine grained compilation as there is no opportunity to do fusion or other optimizations across the stages.

-Z-

On Thu, Mar 8, 2018 at 10:49 AM, Li-Wen Chang notifications@github.com wrote:

Hi Andrew,

Let me make sure one thing about the overhead first. When I said realization overhead, I mean overhead from .realize function only, and I already did .compile_jit before it. That overhead didn't count jit compilation. In this sense, can AOT compilation still avoid more overhead than jit?

In the previous dummy example, the overhead is about 3500 ns per pipeline (realization). If I do use very fine-grained pipeline, this overhead will become noticeable. In the real applications, we did observe it. Compared to a fine-grained pipeline, a coarse-grained pipeline is more or less like a coarse-grained function containing inlined fine-grained functions. I can expect some advantages if we use it properly.

For the crash, it meets my expectation. But a larger stack can only temporarily solve the current problem, because problems I am solving are getting bigger and bigger. I think it is safer if I have a guideline to avoid this. If I can estimate how far I am reaching the limit. I can split the pipeline before it reaches the simplifier's limit. Do you have suggestion for me to estimate it?

Thanks,

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/halide/Halide/issues/2800#issuecomment-371585078, or mute the thread https://github.com/notifications/unsubscribe-auth/ABbqFIGFgsykrY3fYTGvNdMytIXFPR6Kks5tcX1HgaJpZM4SiHK0 .

liwchang commented 6 years ago

Hi Zalman, Good to know those flags for AOT. Other than those error checking, AOT should have a similar overhead of JIT, if I compile it with a right target, right?

I always preferred using a coarse-grained compilation since it is easier and faster, but it failed due to too coarse-grained. (see above discussion) And I couldn't find a way to detect and avoid it before it happened.

abadams commented 6 years ago

I was indeed talking about the overhead of calling realize. It has to traverse the IR and look for inputs to validate things, which is slow. AOT compilation has close to zero overhead.

We don't have a good answer for bailing out more gracefully when we run out of stack, unfortunately (e.g. throwing an exception that the program is too large). It's a problem we're acutely aware of though.

On Thu, Mar 8, 2018 at 11:12 AM, Li-Wen Chang notifications@github.com wrote:

Hi Zalman, Good to know those flags for AOT. Other than those error checking, AOT should have a similar overhead of JIT, if I compile it with a right target, right?

I always preferred using a coarse-grained compilation since it is easier and faster, but it failed due to too coarse-grained. (see above discussion) And I couldn't find a way to detect and avoid it before it happened.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/halide/Halide/issues/2800#issuecomment-371591899, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfdRosyRezSJl3Aa3cbR807nRFeqyD_ks5tcYKWgaJpZM4SiHK0 .

liwchang commented 6 years ago

Hi Andrew, Thanks for pointing it out. I will seriously consider AOT again.

Thanks a lot

abadams commented 6 years ago

We should probably also try to reduce the overhead of realize. I think it's all in prepare_jit_call_arguments, and that function looks like it could be optimized. Zalman and/or I will take a look at it.

On Thu, Mar 8, 2018 at 11:44 AM, Li-Wen Chang notifications@github.com wrote:

Hi Andrew, Thanks for pointing it out. I will seriously consider AOT again.

Thanks a lot

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/halide/Halide/issues/2800#issuecomment-371601428, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfdRlKD2JVGCCu7-kvjRs0mQ6_GdvJIks5tcYomgaJpZM4SiHK0 .

steven-johnson commented 6 years ago

Just throwing this out there:

could we consider a bold (breaking?) API change that made JIT calls look like AOT calls?

zvookin commented 6 years ago

compile_jit plus ParamMap effectively delivers the two things Steven mentioned. It sounds like there are a couple efficiency issues. (And there is still a bug open that concurrency still seems to cause a slowdown. They may be related.)

We could add a templated method that returns the fully typed function pointer I suppose. The issue is that it is only useful for jitting if one knows the C++ function signature at C++ compile time. (And I believe we have an accessor to get the function pointer so this is already doable.)