I420ToBGRARow_NEON EXC_BAD_ACCESS on arm7/iOS

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Call ConvertFromI420
2. ConvertFromI420 calls I420ToBGRARow
3. I420ToBGRARow calls I420ToBGRARow_NEON

What is the expected output? What do you see instead?
GDB lands on asm volatile ( with EXC_BAD_ACCESS

What version of the product are you using? On what operating system?
Latest on iOS 5.1 arm7.

Please provide any additional information below.

y_buf   const uint8 *   0x582d000
*y_buf  unsigned char   94 '^'
u_buf   const uint8 *   0x5857300
v_buf   const uint8 *   0x5861bc0
rgb_buf uint8 * 0x586d000
width   int 480

Original issue reported on code.google.com by J...@junglecat.org on 12 Mar 2012 at 4:53

GoogleCodeExporter commented 9 years ago

Did you really want BGRA?  ARGB is more likely what you want?  - ARGB is a in 
the high byte of a register, or b,g,r,a in memory.

Is this 480x270?  Is the buffer at least 518400 bytes, and stride 1920?

Original comment by fbarch...@google.com on 12 Mar 2012 at 5:54

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I am capturing in 480x360 32BGRA. The length in bytes is 691208.  Stride is 
1920. So the conversion to I420 is OK.

I think that you are correct that I actually want ARGB to render to screen 
however I try that and it crashes in I420ToARGBRow_NEON.

The I420 frame is 259200 bytes in length. 480 x 360 with a stride of 544.

The variables to I420ToARGBRow_NEON are:

y_buf   const uint8 *   0x46e9000
u_buf   const uint8 *   0x4713300
v_buf   const uint8 *   0x471dbc0
rgb_buf uint8 * 0x4729000
width   int 480

Original comment by J...@junglecat.org on 12 Mar 2012 at 5:57

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

If you capture 480x360 I420, you should have 480*360=172800 bytes for the Y 
plane, 43200 bytes for U plane and 43200 bytes for V plane.
0x4713300 - 0x46E9000 = 0x2A300 = 172800.  Looks good.
0x471dbc0 - 0x4713300 = 0xA8C0 = 43200.  Also good.
You should be passing 480 for y_stride and 240 for u_stride and v_stride.
Recent versions take a dst_sample_stride=480*4 = 1920.  Sounds good.
Buffer should be 480 * 360 * 4 = 691200.  You provided 8 extra bytes, but your 
pointer is aligned, so I don't see an issue.

Try the C version instead, to see if its a Neon specific issue?
libyuv::MaskCpuFlags(~libyuv::kCpuHasNEON);

Update to a recent version - 214 preferred.  The old ConvertFromI420 didn't 
support stride.  Or try calling I420ToARGB() directly.

Original comment by fbarch...@google.com on 15 Mar 2012 at 10:11

GoogleCodeExporter commented 9 years ago

I tried the C version and it works. Slow but it works.

Original comment by J...@junglecat.org on 16 Mar 2012 at 12:07

GoogleCodeExporter commented 9 years ago

Do you know what version of libyuv you're on?

Original comment by fbarch...@google.com on 16 Mar 2012 at 2:07

GoogleCodeExporter commented 9 years ago

The latest from SVN.

Original comment by J...@junglecat.org on 17 Mar 2012 at 5:52

GoogleCodeExporter commented 9 years ago

Looking at the code, I don't immediately spot any issues.
At line 24 in 
http://code.google.com/p/libyuv/source/browse/trunk/source/row_neon.cc
It depends on the ability to vector load into the 2nd word of a register.  I'm 
wondering if clang that you use assembles that differently than gcc?  Is there 
a way you can do a gcc build to confirm?

Original comment by fbarch...@google.com on 20 Mar 2012 at 8:54

GoogleCodeExporter commented 9 years ago

Here's the supported compilers as used by Xcode 4.3.1.

(Default) Apple LLVM compiler 3.1
LLVM GCC 4.2

I've done a build using LLVM GCC 4.2 and I get EXC_BAD_ACCESS at line 86 of 
row_neon.cc.

y_buf   uint8 * 0x56db000
u_buf   uint8 * 0x5705300
v_buf   uint8 * 0x570fbc0
rgb_buf uint8 * 0x571b000
width   int 480

Original comment by J...@junglecat.org on 22 Mar 2012 at 11:08

GoogleCodeExporter commented 9 years ago

line 86 is   );
... its just saying its somewhere in that assembly function.  Do you get a 
disassembly of the instruction that crashed?
Is it these ones?
vld1.u32   {d2[0]}, [%1]! 
vld1.u32   {d2[1]}, [%2]!

The idea is to load 4 bytes of U and 4 bytes of V and post increment.
If it thinks its 8 or 16, it might post increment off the end of the buffers.
As a test you could remove the !
The color would be wrong but it would confirm if its reading/incrementing too 
much.

Any Neon / clang experts out there?
This is the function in question:
#define YUVTORGB                                                               \
    "vld1.u8    {d0}, [%0]!                    \n"                             \
    "vld1.u32   {d2[0]}, [%1]!                 \n"                             \
    "vld1.u32   {d2[1]}, [%2]!                 \n"                             \
    "veor.u8    d2, d26                        \n"/*subtract 128 from u and v*/\
    "vmull.s8   q8, d2, d24                    \n"/*  u/v B/R component      */\
    "vmull.s8   q9, d2, d25                    \n"/*  u/v G component        */\
    "vmov.u8    d1, #0                         \n"/*  split odd/even y apart */\
    "vtrn.u8    d0, d1                         \n"                             \
    "vsub.s16   q0, q0, q15                    \n"/*  offset y               */\
    "vmul.s16   q0, q0, q14                    \n"                             \
    "vadd.s16   d18, d19                       \n"                             \
    "vqadd.s16  d20, d0, d16                   \n"                             \
    "vqadd.s16  d21, d1, d16                   \n"                             \
    "vqadd.s16  d22, d0, d17                   \n"                             \
    "vqadd.s16  d23, d1, d17                   \n"                             \
    "vqadd.s16  d16, d0, d18                   \n"                             \
    "vqadd.s16  d17, d1, d18                   \n"                             \
    "vqrshrun.s16 d0, q10, #6                  \n"                             \
    "vqrshrun.s16 d1, q11, #6                  \n"                             \
    "vqrshrun.s16 d2, q8, #6                   \n"                             \
    "vmovl.u8   q10, d0                        \n"/*  set up for reinterleave*/\
    "vmovl.u8   q11, d1                        \n"                             \
    "vmovl.u8   q8, d2                         \n"                             \
    "vtrn.u8    d20, d21                       \n"                             \
    "vtrn.u8    d22, d23                       \n"                             \
    "vtrn.u8    d16, d17                       \n"                             \
void I420ToARGBRow_NEON(const uint8* y_buf,
                        const uint8* u_buf,
                        const uint8* v_buf,
                        uint8* rgb_buf,
                        int width) {
  asm volatile(
    "vld1.u8    {d24}, [%5]                    \n"
    "vld1.u8    {d25}, [%6]                    \n"
    "vmov.u8    d26, #128                      \n"
    "vmov.u16   q14, #74                       \n"
    "vmov.u16   q15, #16                       \n"
  "1:                                          \n"
YUVTORGB
    "vmov.u8    d21, d16                       \n"
    "vmov.u8    d23, #255                      \n"
    "vst4.u8    {d20, d21, d22, d23}, [%3]!    \n"
    "subs       %4, %4, #8                     \n"
    "bgt        1b                             \n"
    : "+r"(y_buf),    // %0
      "+r"(u_buf),    // %1
      "+r"(v_buf),    // %2
      "+r"(rgb_buf),  // %3
      "+r"(width)     // %4
    : "r"(kUVToRB),   // %5
      "r"(kUVToG)     // %6
    : "cc", "memory", "q0", "q1", "q2", "q3", "q8", "q9",
                      "q10", "q11", "q12", "q13", "q14", "q15"
  );
}

Original comment by fbarch...@google.com on 22 Mar 2012 at 11:53

GoogleCodeExporter commented 9 years ago

Would it be possible to get your object file?   (attach a file)
I'd like to check the disassembly (otool -tV) matches the source.

Original comment by fbarch...@google.com on 23 Mar 2012 at 12:09

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Unix Executable File

Original comment by J...@junglecat.org on 23 Mar 2012 at 6:05

Attachments:

Media

GoogleCodeExporter commented 9 years ago

I was unable to obtain a disassembly.

Original comment by jungleca...@gmail.com on 24 Mar 2012 at 12:11

GoogleCodeExporter commented 9 years ago

Disassembling you binary worked, but it contains no Neon?
Are you able to point to the assembly instruction that failed?

Original comment by fbarch...@google.com on 28 Mar 2012 at 11:50

GoogleCodeExporter commented 9 years ago

My apologies but that was an older binary. I've just built this one.

Original comment by J...@junglecat.org on 29 Mar 2012 at 2:20

Attachments:

Media.zip

GoogleCodeExporter commented 9 years ago

The latest binary EXC_BAD_ACCESS lands at row_neon.cc line 64 using the latest 
source from svn. 

y_buf   const uint8 *   0x528a000
u_buf   const uint8 *   0x52a2c00
v_buf   const uint8 *   0x52a8f00
rgb_buf uint8 * 0x52b0000
width   int 352

I was unable to obtain a disassembly.

Original comment by J...@junglecat.org on 29 Mar 2012 at 2:25

GoogleCodeExporter commented 9 years ago

arm7?  Does your iPhone/iPod have NEON?
iPhone3GS or iPad are required.
The same libyuv library will work, but you need to disable Neon.

Original comment by fbarch...@google.com on 29 Mar 2012 at 2:37

GoogleCodeExporter commented 9 years ago

Yes it is ARM7. Yes it is for iPhone 4 and iPad 2. I'm attaching a new binary 
which I've disabled NEON.

  //cpu_info_ = kCpuHasNEON | kCpuInitialized;
  cpu_info_ = kCpuInitialized;

So the last binary is the same as this one except it has NEON support built in. 
The first binary is problematic and the second works.

Original comment by J...@junglecat.org on 29 Mar 2012 at 2:52

Attachments:

Media.zip

GoogleCodeExporter commented 9 years ago

Is there anything I can do to provide more information to help resolve this?

Interestingly ConvertToI420 and I420ToBGRA are the most expensive calls in my 
iOS test app. Oddly VP8(webm) encoding and decoding are less expensive.

I have to assume this is because of the lack of use of NEON.

Original comment by J...@junglecat.org on 2 Apr 2012 at 4:24

Attachments:

[Screen Shot 2012-04-02 at 12.20.25 AM.png](https://storage.googleapis.com/google-code-attachments/libyuv/issue-19/comment-22/Screen Shot 2012-04-02 at 12.20.25 AM.png)

GoogleCodeExporter commented 9 years ago

Hi J... sorry for slow response.

> Re  Is there anything I can do to provide more information to help resolve 
this?

Yes, the exact assembly instruction that fails and a register dump would help.

> Re  Interestingly ConvertToI420 and I420ToBGRA are the most expensive calls 
in my iOS test app. Oddly VP8(webm) encoding and decoding are less expensive.

This isn't entirely unexpected.  In my apps, color conversion can be the single 
most expensive operation.  As a whole, encoding would take more, but when you 
profile, the conversions are the slowest, and least useful functions.  Which is 
why its good to optimize them, but better to avoid conversions entirely.  Stick 
with YUV for capture and renderning if you can.

>Re I have to assume this is because of the lack of use of NEON.

No, but thats why I asked. Any iPad, or iPhone3GS or better has Neon.
Since your CPU has Neon, the problem is elsewhere.

I asked for your binary because you're using a different compiler, and I've 
seen issues with assembly between gcc and clang on a couple low level details:
1. the @ notation for alignment
Solution: I've avoided it
Downside: Potential performance loss - aligned loads may be faster
2. passing of multimedia arrays from C to inline
Solution: typedef for sized vectors 
Downside: none
3. the [] notation for accessing register offsets
unknown

Original comment by fbarch...@google.com on 14 Apr 2012 at 1:01

GoogleCodeExporter commented 9 years ago

Because I've heard rumor of people using libyuv with NEON successfully on iOS 
and that I don't believe maintaining the conversion routines is of any benefit 
I'm altering my code to perform capture in YUV and render in YUV as to keep 
conversion to a minimal. 

This effectively means I will only require libyuv for rotation operations.

I really appreciate your help!

Original comment by J...@junglecat.org on 16 Apr 2012 at 6:34

GoogleCodeExporter commented 9 years ago

Reproduced.
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 7f7f7f7f
Stack frame #00 pc 000cb67c 
/data/data/org.webrtc.videoengineapp/lib/libwebrtc-video-demo-jni.so: Routine 
I420ToARGBRow_NEON in third_party/libyuv/source/row_neon.cc:86
Stack frame #01 pc 000c75e4 
/data/data/org.webrtc.videoengineapp/lib/libwebrtc-video-demo-jni.so: Routine 
I420ToRGB565 in third_party/libyuv/source/convert_from.cc:950

    "vld1.u8    {d24}, [%5]                    \n"
  : "r"(kUVToRB),   // %5

It was passing the vector itself, rather than its address and then tried to 
load from that address.
Also affected rotate.
r246 has a fix.

Original comment by fbarch...@google.com on 21 Apr 2012 at 1:08

GoogleCodeExporter commented 9 years ago

Original comment by fbarch...@google.com on 21 Apr 2012 at 1:18

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

A related issue was found in scale.  vector constants are treated as values, so 
& is needed to get their pointer for a vld.
On x86 "m" is used to refer to constants.  The instructions will operate 
directly on memory.

r247 fixes the scale for Neon issue.

Original comment by fbarch...@google.com on 22 Apr 2012 at 3:06

Changed state: Started

GoogleCodeExporter commented 9 years ago

r247 checked into chromium.

Original comment by fbarch...@google.com on 25 Apr 2012 at 2:59

GoogleCodeExporter commented 9 years ago

Original comment by fbarch...@google.com on 1 May 2012 at 2:58

Changed state: Fixed

iamnpc / libyuv

I420ToBGRARow_NEON EXC_BAD_ACCESS on arm7/iOS #19