Open Infinoid opened 3 years ago
Have you enabled large_buffers for your target? https://github.com/halide/Halide/search?q=large_buffers&type=code
IIRC that controls memory indices, and enabled 64 bit sizes.
Yes. But this bug is about the Buffer class's overloads and accessor methods. The test program demonstrates problems without doing any code generation at all.
It appears that this example triggers an overflow in the stride of the third dimension due to the fact that halide_dimension_t
stores stride
as an int32_t
. We likely need better overflow checking (which I will look into), or need to use int64_t
instead.
@abadams Would it be reasonable to have an optional setting for using int64_t
for halide_dimension_t
?
I assume that the use of int32_t
comes from Halide's use of Int32 for indices, and that the lack of overflow checking comes from the assumption that Int32 (and Int64) expressions don't overflow - but this seems dangerous in the case of the stride
of a dimension
In the near-term, it would be great if these overflows could be detected before they turn into crashes.
My experience with the codebase is somewhat limited, but I don't see an easy fix to this stride issue though. Just trying to change the stride
type to int64_t
seems non-trivial - the assumption about the int
types seems to permeate the codebase. For example, after changing the stride
handling in the runtime code (HalideBuffer.h, HalideRuntime.h, and halide_buffer_t.cpp), I then get:
/home/cimes/casper/Halide-master/src/AddImageChecks.cpp:514:40: error: use of overloaded operator '=' is ambiguous (with operand types 'Halide::Expr' and 'int64_t' (aka 'long'))
stride_constrained = image.dim(i).stride();
~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~
/home/cimes/casper/Halide-master/src/Expr.h:256:8: note: candidate function (the implicit copy assignment operator)
struct Expr : public Internal::IRHandle {
^
I expect this is just the tip of the iceberg.
I've looked at the int
datatype choice(s) previously as we've experimented with very large data sizes (which already involved some fixes in Halide). For example, even the INT_MAX size constraint on a single dimension size is limiting. I expect a larger conversation is needed about how to support large data sizes in Halide. Some pretty invasive changes appear to be needed. This support is definitely of interest to us, though.
The issue here is indeed that halide_dimension_t stores a stride as an int32_t, and changing this struct is nearly impossible. Last time I tried it took me two years, and I could only do it because I could test against all the code in Google. int32s for addressing pervade Halide because they cover most use-cases and have a real performance advantage - address computations for gathers can vectorize twice as wide on SIMD hardware if they can stay in 32-bits.
The only practical way to address this issue is what @cimes-isi says: Add assertions in the Buffer construct that detect when a stride is going to overflow and throw an error instead.
I expect we'll ultimately add a target flag that makes the compiler use a new buffer type with 64-bit fields. This is not a short term thing, but I expect it will happen in the next few years. There are other changes we might consider when doing this as well to make further changes to the buffer type easier.
If this sounds like option creep, consider it as part of a longterm ideal to make the independent parts of the Halide compiler reusable. In that model, we should be able to generate interfaces for different uses more easily -- e.g. different language bindings, etc. If we're already doing that, and the compiler supports full 64-bit addresses and indexes internally, then it's just another wrapper generation with slightly different structures and configuring the compilation chain to generate the right code for it.
A possible workaround here is to just move the dimension that has an overflowing stride to be a tuple instead. In this case, buf
could be 3 separate 2D buffers, passed to the Halide pipeline as a tuple. The first step of the pipeline could be a func like:
Func input;
input(x, y, c) = mux(c, {raw_input(x, y)[0], raw_input(x, y)[1], raw_input(x, y)[2]});
(This might be the same thing as suggested by @youngpm on gitter)
The issue here is indeed that halide_dimension_t stores a stride as an int32_t, and changing this struct is nearly impossible. Last time I tried it took me two years, and I could only do it because I could test against all the code in Google. int32s for addressing pervade Halide because they cover most use-cases and have a real performance advantage - address computations for gathers can vectorize twice as wide on SIMD hardware if they can stay in 32-bits.
Dear @abadams this is quite interesting, thanks for sharing. I'm curious if the challenge in changing the halide_dimension_t
to 64bit is mostly a performance concern, or did you encounter significant other practical limitations? If its predominantly a performance issue, it may be worthwhile to at least offer this as a compile choice in order to play with it and gain a better understanding of the severance of the problem.
The design is the way it is for performance, as described above. So changing it by default can't be done. Adding it as a separate version of halide_buffer_t and halide_dimension_t would be viable, but quite a lot of work. As well as making changes inside the compiler, there's a bunch of of support code for dealing with buffers that would need to be duplicated or parameterized (e.g. src/runtime/halide_buffer_t.cpp and src/runtime/HalideRuntime.h).
address computations for gathers can vectorize twice as wide on SIMD hardware if they can stay in 32-bits.
When does this come into play? The histogram example in tutorial 09, perhaps? Or strided vector reads/writes?
Vectorized data is very obvious in Halide, as it's explicit in the scheduling language. But I don't have a clear idea of where or how often vectorized address computations occur.
Could you point to something in the example apps that would be hit particularly hard by this?
apps/local_laplacian and apps/bilateral_grid both do vectorized address computations in the final stage. It comes up quite a lot. Anything with a look-up table also needs it.
Thanks, that helps a lot.
@abadams you wrote:
The design is the way it is for performance, as described above. So changing it by default can't be done.
If the choice is for performance then I can understand that most users would want to keep the current behavior. But it would seem that an optional choice is sensible, i.e. at build time? Basically the halide buffer strides could be a typedef that users can override at build time?
It would open Halide for new workloads, at least in a test setup?
Bumping up the issue. Actually, I think it would make sense to separate stride and index type. This would keep calculating indices fast, only slowing down the calculation of offsets. Since Halide targets mostly image processing, it's rather unlikely that a single extent will exceed 2G - that's not true for strides. There are use cases with aerial photography, astronomy, microscopy (for medical and materials sciences) and high resolution scans of works of art - all of which can easily exceed 2 Gpixels. Moving to 3D domains (e.g. for high resolution CT) also makes 2G look small. Also, while the interface type for stride could be 64-bit, there could also be a flag for enabling large buffers in codegen, which could be kept disabled by default, raising an error in case where a >2G stride is encountered. With the flag disabled, the strides could be demoted to 32-bit, keeping performance. This approach avoids duplication of interface structures and requires mostly backend work.
If I use a
Buffer<uint8_t>
to store a 40K by 80K color image in planar RGB format, I get weird pointer problems when I try to use it.Buffer<uint8_t> buf(40960, 81920, 3);
Here are the problems I see:
int
, and returns a negative number in this case. INT_MAX is2^31
, and40960*81920
is greater than that.buf.begin()
and&buf(0,0,0)
produce 2 different pointers.&buf(0,0,1)
returns a pointer whose address is less than&buf(0,0,0)
.&buf(0,0,1)
pointer can segfault.Here is a test program that demonstrates this: https://gist.github.com/Infinoid/cc3a4f392a563e1d4b3c318718593869
It fails when run with the current Halide master (89f5ee7).
If you change line 11 to
#define Y 20480
, it now fits within 4GB and the test program succeeds.