Closed GoogleCodeExporter closed 9 years ago
Did you really want BGRA? ARGB is more likely what you want? - ARGB is a in
the high byte of a register, or b,g,r,a in memory.
Is this 480x270? Is the buffer at least 518400 bytes, and stride 1920?
Original comment by fbarch...@google.com
on 12 Mar 2012 at 5:54
[deleted comment]
[deleted comment]
I am capturing in 480x360 32BGRA. The length in bytes is 691208. Stride is
1920. So the conversion to I420 is OK.
I think that you are correct that I actually want ARGB to render to screen
however I try that and it crashes in I420ToARGBRow_NEON.
The I420 frame is 259200 bytes in length. 480 x 360 with a stride of 544.
The variables to I420ToARGBRow_NEON are:
y_buf const uint8 * 0x46e9000
u_buf const uint8 * 0x4713300
v_buf const uint8 * 0x471dbc0
rgb_buf uint8 * 0x4729000
width int 480
Original comment by J...@junglecat.org
on 12 Mar 2012 at 5:57
[deleted comment]
If you capture 480x360 I420, you should have 480*360=172800 bytes for the Y
plane, 43200 bytes for U plane and 43200 bytes for V plane.
0x4713300 - 0x46E9000 = 0x2A300 = 172800. Looks good.
0x471dbc0 - 0x4713300 = 0xA8C0 = 43200. Also good.
You should be passing 480 for y_stride and 240 for u_stride and v_stride.
Recent versions take a dst_sample_stride=480*4 = 1920. Sounds good.
Buffer should be 480 * 360 * 4 = 691200. You provided 8 extra bytes, but your
pointer is aligned, so I don't see an issue.
Try the C version instead, to see if its a Neon specific issue?
libyuv::MaskCpuFlags(~libyuv::kCpuHasNEON);
Update to a recent version - 214 preferred. The old ConvertFromI420 didn't
support stride. Or try calling I420ToARGB() directly.
Original comment by fbarch...@google.com
on 15 Mar 2012 at 10:11
I tried the C version and it works. Slow but it works.
Original comment by J...@junglecat.org
on 16 Mar 2012 at 12:07
Do you know what version of libyuv you're on?
Original comment by fbarch...@google.com
on 16 Mar 2012 at 2:07
The latest from SVN.
Original comment by J...@junglecat.org
on 17 Mar 2012 at 5:52
Looking at the code, I don't immediately spot any issues.
At line 24 in
http://code.google.com/p/libyuv/source/browse/trunk/source/row_neon.cc
It depends on the ability to vector load into the 2nd word of a register. I'm
wondering if clang that you use assembles that differently than gcc? Is there
a way you can do a gcc build to confirm?
Original comment by fbarch...@google.com
on 20 Mar 2012 at 8:54
Here's the supported compilers as used by Xcode 4.3.1.
(Default) Apple LLVM compiler 3.1
LLVM GCC 4.2
I've done a build using LLVM GCC 4.2 and I get EXC_BAD_ACCESS at line 86 of
row_neon.cc.
y_buf uint8 * 0x56db000
u_buf uint8 * 0x5705300
v_buf uint8 * 0x570fbc0
rgb_buf uint8 * 0x571b000
width int 480
Original comment by J...@junglecat.org
on 22 Mar 2012 at 11:08
line 86 is );
... its just saying its somewhere in that assembly function. Do you get a
disassembly of the instruction that crashed?
Is it these ones?
vld1.u32 {d2[0]}, [%1]!
vld1.u32 {d2[1]}, [%2]!
The idea is to load 4 bytes of U and 4 bytes of V and post increment.
If it thinks its 8 or 16, it might post increment off the end of the buffers.
As a test you could remove the !
The color would be wrong but it would confirm if its reading/incrementing too
much.
Any Neon / clang experts out there?
This is the function in question:
#define YUVTORGB \
"vld1.u8 {d0}, [%0]! \n" \
"vld1.u32 {d2[0]}, [%1]! \n" \
"vld1.u32 {d2[1]}, [%2]! \n" \
"veor.u8 d2, d26 \n"/*subtract 128 from u and v*/\
"vmull.s8 q8, d2, d24 \n"/* u/v B/R component */\
"vmull.s8 q9, d2, d25 \n"/* u/v G component */\
"vmov.u8 d1, #0 \n"/* split odd/even y apart */\
"vtrn.u8 d0, d1 \n" \
"vsub.s16 q0, q0, q15 \n"/* offset y */\
"vmul.s16 q0, q0, q14 \n" \
"vadd.s16 d18, d19 \n" \
"vqadd.s16 d20, d0, d16 \n" \
"vqadd.s16 d21, d1, d16 \n" \
"vqadd.s16 d22, d0, d17 \n" \
"vqadd.s16 d23, d1, d17 \n" \
"vqadd.s16 d16, d0, d18 \n" \
"vqadd.s16 d17, d1, d18 \n" \
"vqrshrun.s16 d0, q10, #6 \n" \
"vqrshrun.s16 d1, q11, #6 \n" \
"vqrshrun.s16 d2, q8, #6 \n" \
"vmovl.u8 q10, d0 \n"/* set up for reinterleave*/\
"vmovl.u8 q11, d1 \n" \
"vmovl.u8 q8, d2 \n" \
"vtrn.u8 d20, d21 \n" \
"vtrn.u8 d22, d23 \n" \
"vtrn.u8 d16, d17 \n" \
void I420ToARGBRow_NEON(const uint8* y_buf,
const uint8* u_buf,
const uint8* v_buf,
uint8* rgb_buf,
int width) {
asm volatile(
"vld1.u8 {d24}, [%5] \n"
"vld1.u8 {d25}, [%6] \n"
"vmov.u8 d26, #128 \n"
"vmov.u16 q14, #74 \n"
"vmov.u16 q15, #16 \n"
"1: \n"
YUVTORGB
"vmov.u8 d21, d16 \n"
"vmov.u8 d23, #255 \n"
"vst4.u8 {d20, d21, d22, d23}, [%3]! \n"
"subs %4, %4, #8 \n"
"bgt 1b \n"
: "+r"(y_buf), // %0
"+r"(u_buf), // %1
"+r"(v_buf), // %2
"+r"(rgb_buf), // %3
"+r"(width) // %4
: "r"(kUVToRB), // %5
"r"(kUVToG) // %6
: "cc", "memory", "q0", "q1", "q2", "q3", "q8", "q9",
"q10", "q11", "q12", "q13", "q14", "q15"
);
}
Original comment by fbarch...@google.com
on 22 Mar 2012 at 11:53
Would it be possible to get your object file? (attach a file)
I'd like to check the disassembly (otool -tV) matches the source.
Original comment by fbarch...@google.com
on 23 Mar 2012 at 12:09
[deleted comment]
Unix Executable File
Original comment by J...@junglecat.org
on 23 Mar 2012 at 6:05
Attachments:
I was unable to obtain a disassembly.
Original comment by jungleca...@gmail.com
on 24 Mar 2012 at 12:11
Disassembling you binary worked, but it contains no Neon?
Are you able to point to the assembly instruction that failed?
Original comment by fbarch...@google.com
on 28 Mar 2012 at 11:50
My apologies but that was an older binary. I've just built this one.
Original comment by J...@junglecat.org
on 29 Mar 2012 at 2:20
Attachments:
The latest binary EXC_BAD_ACCESS lands at row_neon.cc line 64 using the latest
source from svn.
y_buf const uint8 * 0x528a000
u_buf const uint8 * 0x52a2c00
v_buf const uint8 * 0x52a8f00
rgb_buf uint8 * 0x52b0000
width int 352
I was unable to obtain a disassembly.
Original comment by J...@junglecat.org
on 29 Mar 2012 at 2:25
arm7? Does your iPhone/iPod have NEON?
iPhone3GS or iPad are required.
The same libyuv library will work, but you need to disable Neon.
Original comment by fbarch...@google.com
on 29 Mar 2012 at 2:37
Yes it is ARM7. Yes it is for iPhone 4 and iPad 2. I'm attaching a new binary
which I've disabled NEON.
//cpu_info_ = kCpuHasNEON | kCpuInitialized;
cpu_info_ = kCpuInitialized;
So the last binary is the same as this one except it has NEON support built in.
The first binary is problematic and the second works.
Original comment by J...@junglecat.org
on 29 Mar 2012 at 2:52
Attachments:
Is there anything I can do to provide more information to help resolve this?
Interestingly ConvertToI420 and I420ToBGRA are the most expensive calls in my
iOS test app. Oddly VP8(webm) encoding and decoding are less expensive.
I have to assume this is because of the lack of use of NEON.
Original comment by J...@junglecat.org
on 2 Apr 2012 at 4:24
Attachments:
Hi J... sorry for slow response.
> Re Is there anything I can do to provide more information to help resolve
this?
Yes, the exact assembly instruction that fails and a register dump would help.
> Re Interestingly ConvertToI420 and I420ToBGRA are the most expensive calls
in my iOS test app. Oddly VP8(webm) encoding and decoding are less expensive.
This isn't entirely unexpected. In my apps, color conversion can be the single
most expensive operation. As a whole, encoding would take more, but when you
profile, the conversions are the slowest, and least useful functions. Which is
why its good to optimize them, but better to avoid conversions entirely. Stick
with YUV for capture and renderning if you can.
>Re I have to assume this is because of the lack of use of NEON.
No, but thats why I asked. Any iPad, or iPhone3GS or better has Neon.
Since your CPU has Neon, the problem is elsewhere.
I asked for your binary because you're using a different compiler, and I've
seen issues with assembly between gcc and clang on a couple low level details:
1. the @ notation for alignment
Solution: I've avoided it
Downside: Potential performance loss - aligned loads may be faster
2. passing of multimedia arrays from C to inline
Solution: typedef for sized vectors
Downside: none
3. the [] notation for accessing register offsets
unknown
Original comment by fbarch...@google.com
on 14 Apr 2012 at 1:01
Because I've heard rumor of people using libyuv with NEON successfully on iOS
and that I don't believe maintaining the conversion routines is of any benefit
I'm altering my code to perform capture in YUV and render in YUV as to keep
conversion to a minimal.
This effectively means I will only require libyuv for rotation operations.
I really appreciate your help!
Original comment by J...@junglecat.org
on 16 Apr 2012 at 6:34
Reproduced.
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 7f7f7f7f
Stack frame #00 pc 000cb67c
/data/data/org.webrtc.videoengineapp/lib/libwebrtc-video-demo-jni.so: Routine
I420ToARGBRow_NEON in third_party/libyuv/source/row_neon.cc:86
Stack frame #01 pc 000c75e4
/data/data/org.webrtc.videoengineapp/lib/libwebrtc-video-demo-jni.so: Routine
I420ToRGB565 in third_party/libyuv/source/convert_from.cc:950
"vld1.u8 {d24}, [%5] \n"
: "r"(kUVToRB), // %5
It was passing the vector itself, rather than its address and then tried to
load from that address.
Also affected rotate.
r246 has a fix.
Original comment by fbarch...@google.com
on 21 Apr 2012 at 1:08
Original comment by fbarch...@google.com
on 21 Apr 2012 at 1:18
A related issue was found in scale. vector constants are treated as values, so
& is needed to get their pointer for a vld.
On x86 "m" is used to refer to constants. The instructions will operate
directly on memory.
r247 fixes the scale for Neon issue.
Original comment by fbarch...@google.com
on 22 Apr 2012 at 3:06
r247 checked into chromium.
Original comment by fbarch...@google.com
on 25 Apr 2012 at 2:59
Original comment by fbarch...@google.com
on 1 May 2012 at 2:58
Original issue reported on code.google.com by
J...@junglecat.org
on 12 Mar 2012 at 4:53