Micro optimizations of array reference access. Such optimizations should be used sparingly of course, but this loop seemed to be one of those places.
I noticed that the number of worker threads were set using the number of processors + 1 rule. In this case I think it's more fitting to set the number of threads to being equal to the number of CPU:s since we can avoid some context switches that way and keep the data in the CPU caches longer.
Replaced Math.tanh() with FastMath.tanh() from Apache Commons. This added a new dependency, but since it's quite possible it's useful in other places this shouldn't be so bad.
Made three different optimizations: