Up to 2.4x faster training (for deep models)

SkBlaz commented 1 year ago

Turns out individual neuron updates (part after sgemv) are quite computationally heavy. While profiling, identified a corner case where quite a few operations are completely redundant -- scenario where general gradient is zero. This turns out to be surprisingly common (based on statistics of grad dumps), lending itself as an interesting optimization opportunity.

By skipping updates where everything would be zeroed-out anyways, this offers consistent 1.43x speedup with no loss in predictive performance for a regular 2-h-layer architecture 75-150.

In theory this effect should be even more apparent for deeper nets, which turned out to be the case. For a 150-300 network, the speedup is 2.4x.

Just to test the scaling law here a bit more, also ran unrealistic 300-600 net. The speedup here was 4.1x.

SkBlaz commented 1 year ago

Wow, quite a catch... well done! And I suspect it might be very beneficial for reducing the hogwild cache coherence bottleneck. Did you test it on a dataset and compare the output weights, just to see they are completely identical? That would be a good enough regression test for me to merge this

@bbenshalom did a bunch of benchmarks, will conduct the weight parity test too, good idea

yonatankarni commented 1 year ago

AWESOME!

outbrain / fwumious_wabbit

Up to 2.4x faster training (for deep models) #102