As noted in #54, a switch to an in-place gradient calculation could speed up performance by avoiding array allocations:
But if in the long run you consider switching to a unified in-place _criterion_andgradient!(grad::Union{AbstractVector, Nothing}, ...), which would skip gradient calculation if grad === nothing, this PR would be irrelevant.
Right now the profiling shows that for a 62x24 matrix rotation ~10% is spent in the array allocation (both gradient calculation and projection).
As noted in #54, a switch to an in-place gradient calculation could speed up performance by avoiding array allocations: