Closed jtanx closed 9 years ago
Hmm instead of
vld3.8 {d0-d2}, [r0]!
vswp d0, d2
vst3.8 {d0-d2}, [r1]!
What about
vld3.8 {d0-d2}, [r0]!
vst3.8 {d2,d1,d0}, [r1]!
On RPI2, BGR2RGB routine allows for 800x600 processing in realtime.
http://pulsar.webshaker.net/ccc/result.php?lng=us
loop:
pld [r0, #192] @Preload 3 cache lines ahead
vld3.8 {d0-d2}, [r0]! @Load 8 pixels
vld3.8 {d3-d5}, [r0]! @Load another 8 pixels
vst3.8 {d2,d1,d0}, [r1]! @Store 8 pixels
vst3.8 {d5,d4,d3}, [r1]! @Store another 8 pixels
subs r2, r2, #1 @Decrement counter
bgt loop @Loop check
bx lr
Nope, can't do that. Must use vswp
.
http://stackoverflow.com/questions/11890997/using-arm-neon-intrinsics-to-add-alpha-and-permute