Closed SkBlaz closed 1 year ago
Wow, quite a catch... well done! And I suspect it might be very beneficial for reducing the hogwild cache coherence bottleneck. Did you test it on a dataset and compare the output weights, just to see they are completely identical? That would be a good enough regression test for me to merge this
@bbenshalom did a bunch of benchmarks, will conduct the weight parity test too, good idea
AWESOME!
Turns out individual neuron updates (part after sgemv) are quite computationally heavy. While profiling, identified a corner case where quite a few operations are completely redundant -- scenario where general gradient is zero. This turns out to be surprisingly common (based on statistics of grad dumps), lending itself as an interesting optimization opportunity.
By skipping updates where everything would be zeroed-out anyways, this offers consistent 1.43x speedup with no loss in predictive performance for a regular 2-h-layer architecture 75-150.
In theory this effect should be even more apparent for deeper nets, which turned out to be the case. For a 150-300 network, the speedup is 2.4x.
Just to test the scaling law here a bit more, also ran unrealistic 300-600 net. The speedup here was 4.1x.