This reduces the number of outer loop iterations from 32 to 4, with only mild space increase. This should give a good speedup in performance. However, I wonder if the extra 253 bytes of storage needed could be detrimental to embedded support? Maybe we need a flag or an overload for the other strategy instead?
This reduces the number of outer loop iterations from 32 to 4, with only mild space increase. This should give a good speedup in performance. However, I wonder if the extra 253 bytes of storage needed could be detrimental to embedded support? Maybe we need a flag or an overload for the other strategy instead?