sayrer / libyuv

Automatically exported from code.google.com/p/libyuv
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

ARGBToRGB565 neon use vsri #571

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
aarch64 version uses sri instruction to shift and mask channels together, 
saving instructions.
backport this to neon 32 bit.

Original issue reported on code.google.com by fbarch...@google.com on 24 Feb 2016 at 3:00

GoogleCodeExporter commented 8 years ago
This tutorial covers this particular function
https://community.arm.com/groups/processors/blog/2010/09/01/coding-for-neon--par
t-4-shifting-left-and-right

This is a port of the row_neon64.cc version of this function.
So the differences between 32 and 64 bit are minor.

This is the existing 64 bit code:
#define ARGBTORGB565                                      
    "shll       v0.8h,  v22.8b, #8             \n"  /* R  
    "shll       v21.8h, v21.8b, #8             \n"  /* G  
    "shll       v20.8h, v20.8b, #8             \n"  /* B  
    "sri        v0.8h,  v21.8h, #5             \n"  /* RG 
    "sri        v0.8h,  v20.8h, #11            \n"  /* RGB

This is the 32 bit port:
#define ARGBTORGB565                                        
    "vshll.u8    q0, d22, #8                   \n"  /* R    
    "vshll.u8    q8, d21, #8                   \n"  /* G    
    "vshll.u8    q9, d20, #8                   \n"  /* B    
    "vsri.16     q0, q8, #5                    \n"  /* RG   
    "vsri.16     q0, q9, #11                   \n"  /* RGB  

vsri shifts a register right by an immediate and inserts it into the 
destination.
e.g.
q0 rrrr_rrrr_0000_0000
q8 gggg_gggg_0000_0000

vsri.16     q0, q8, #5
shifts q8 (g) by 5
q8 0000 0ggg_gggg_g000_0000
then masks in 5 bits from q0
q0 rrrr_rggg_gggg_g000_0000

vsri.16     q0, q9, #11
then takes B
q9 bbbb_bbbb_0000_0000
shifts down by 11
q9 0000_0000_000b_bbbb
and masks in 11 bits from q0 with q9
q0 rrrr_rggg_gggb_bbbb

If 4444 were done the same way, it would be 7 instructions, same as it is now.

Now
#define ARGBTOARGB4444                                      
    "vshr.u8    d20, d20, #4                   \n"  /* B    
    "vbic.32    d21, d21, d4                   \n"  /* G    
    "vshr.u8    d22, d22, #4                   \n"  /* R    
    "vbic.32    d23, d23, d4                   \n"  /* A    
    "vorr       d0, d20, d21                   \n"  /* BG   
    "vorr       d1, d22, d23                   \n"  /* RA   
    "vzip.u8    d0, d1                         \n"  /* BGRA 
if done with vsri
#define ARGBTOARGB4444                                      
    "vshll.u8    q0, d23, #8                   \n"  /* A    
    "vshll.u8    q8, d22, #8                   \n"  /* R    
    "vshll.u8    q9, d21, #8                   \n"  /* G    
    "vshll.u8    q10, d20, #8                  \n"  /* B    
    "vsri.16     q0, q8, #4                    \n"  /* AR   
    "vsri.16     q0, q9, #8                    \n"  /* ARG  
    "vsri.16     q0, q10, #12                  \n"  /* ARGB 

but could be done on 8 bit values
#define ARGBTOARGB4444                                       
    "vsri.8      d23, d22, #4                  \n"  /* AR    
    "vsri.8      d21, d20, #4                  \n"  /* GB    
    "vzip.u8     d21, d23                      \n"  /* ARGB  
    "vmov        d0, d21                       \n"           
    "vmov        d1, d23                       \n"  

Original comment by fbarch...@google.com on 25 Feb 2016 at 1:25