%9 = load <32 x bfloat>, ptr getelementptr inbounds ([144 x bfloat], ptr @buff_6, i20 0, i20 8), align 64
to
%9 = load <32 x bfloat>, ptr getelementptr inbounds ([144 x bfloat], ptr @buff_6, i20 0, i20 8), align 2
this later IR, to my llvm-novice brain, looks much more sensible (how you the aligment be 64 when the offset from the pointer is 8??)
Also, this makes conv numerics pass :D
TMI but this being the bug also explains why the issue seemed to be with the offset in the input image data, not the kernel data: The kernel is read with a stride of 32 bfloats (64 bytes) so the alignment of 64 was correct in this case. The input image data is read with stride of 8 bfloats (16 bytes) so assuming alignment of 64 bytes meant reading the wrong data from the image patch. I also observed the image data was being offset by not enough, which also agrees with the bug.
With the removal of this pass, the optimized vectorized IR for conv2d (see https://github.com/nod-ai/iree-amd-aie/issues/820 file vectorized_input.opt.ll) changes from
to
this later IR, to my llvm-novice brain, looks much more sensible (how you the aligment be 64 when the offset from the pointer is 8??)
Also, this makes conv numerics pass :D
TMI but this being the bug also explains why the issue seemed to be with the offset in the input image data, not the kernel data: The kernel is read with a stride of 32 bfloats (64 bytes) so the alignment of 64 was correct in this case. The input image data is read with stride of 8 bfloats (16 bytes) so assuming alignment of 64 bytes meant reading the wrong data from the image patch. I also observed the image data was being offset by not enough, which also agrees with the bug.