Closed AndersTrier closed 6 months ago
Looks like a nice cleanup! btw, added your repo to the README since I assume they are compatible.
wrt unrolling, yeah I settled on 4x since 8x didn't give a significant speedup and didn't want to choke old platforms with limited registers more than needed.
Looks like a nice cleanup! btw, added your repo to the README since I assume they are compatible.
Thanks!
I would have to test wrt. compatibility, as I don't think high rate
and low rate
are compatible, and sometimes you can choose either.
wrt unrolling, yeah I settled on 4x since 8x didn't give a significant speedup and didn't want to choke old platforms with limited registers more than needed.
Did you consider implementing it with SIMD?
I don't know if it's worth it, but I found some code to get you started if you want to give it a try ;) https://github.com/catid/leopard/blob/master/docs/vector_fwht_4.txt https://github.com/paritytech/reed-solomon-novelpoly/blob/df906e1ca27dc6c0c1b663f1653f57d4620f03dd/reed-solomon-novelpoly/src/field/inc_log_mul.rs#L118 https://stackoverflow.com/a/54133143
Yeah, it could be done... It could actually be rather effective AFAICT. But it's been too long since I looked at this to judge if/when it would provide a speedup.
When reworking my Rust implementation of leopard16, I realized that the last loop in
fwht
is unreachable.On the last iteration of the outermost loop,
dist4 == order
, after which we update the value ofdist
todist4
, making us unable to enter the body of the last loop.order
is always either 256 or 16384, andOn a related note: I tried to optimize
fwht
further by using 8 values at a time (fwht_8
). It was slightly faster on x86, but slower on ARM.fwht_16
was slower on both architectures.