The uint8_t to int8_t conversion for CU8 to CS8 with an offset of -127 will cast the raw 255 value to -128 on uint/int conversion (raw 254 is +127). This wrap around is not desired. Measuring the sample bias on some RTL shows 127.26 for me, I take it the 127.4 used in other conversions here is a good estimate /mean. That way I can guess where the -127 offset comes from. The bias can't be removed with 8-bit though. And 0.4 bias with wrapping of maximum positive samples is hardly better than a -0.6 bias.
Also I looked at treating the input data as true Offset binary format (the offset is incidentally -128) and using XOR 0x80 to speed things up.
But both get vectorized (SIMD) with -O3 (it's movdqu+paddb vs. movups+xorps) and the codes is already optimized as is.
The uint8_t to int8_t conversion for CU8 to CS8 with an offset of -127 will cast the raw 255 value to -128 on uint/int conversion (raw 254 is +127). This wrap around is not desired. Measuring the sample bias on some RTL shows 127.26 for me, I take it the 127.4 used in other conversions here is a good estimate /mean. That way I can guess where the -127 offset comes from. The bias can't be removed with 8-bit though. And 0.4 bias with wrapping of maximum positive samples is hardly better than a -0.6 bias.
Also I looked at treating the input data as true Offset binary format (the offset is incidentally -128) and using XOR 0x80 to speed things up. But both get vectorized (SIMD) with -O3 (it's movdqu+paddb vs. movups+xorps) and the codes is already optimized as is.