Closed MichaEiler closed 3 years ago
Sorry about the delay - I'll merge that but is that really the most efficient way of dealing with it in arm64-land? zip/unzip made life more plausible in neon-land and whilst they have slightly different src/dest semantics they still do the right sort of thing.
I looked through the reference documentation: https://developer.arm.com/documentation/dui0802/b/A64-SIMD-Vector-Instructions
Unfortunately nothing looked like it could be used to write a cleaner implementation than zip/unzip. Do you have anything particular in your mind? I'm open to suggestions!
Best, Michael
Yeah - OK - some of my comment was due to unfamiliarity with arm64 and misreading your code. I was surprised by the lack of UZPn/ZIPn instructions and hadn't quite worked out you were doing them via memory. Either way, if it works - hurrah!
As promised here is the remaining function implemented in arm64 assembly: ff_rpi_sand30_lines_to_planar_c16
In my local tests with a 10bit, 38402160p resolution video file I got around 16-18fps. These performance results also didn't change when I used a test video with a resolution of 38362160p instead (= incomplete blocks at the row end). I used the same trick I already explained in the last pull request for the luma conversion.
I also found a small issue in the ff_rpi_sand30_lines_to_planar_y16 method, which used the incorrect src address register when writing the last few pixels of an image. This could end up in garbage being written if the width was not a multiple of 96.
Note about the implementation for the chroma conversion: I used the stack to store intermediate results, but in my local tests the performance of the chroma conversion was still about twice as fast as the luma conversion. Therefore I assume that the additional memory operations didn't affect the total performance significantly (maybe some cache is utilised effectively?).