tensil-ai / tensil

Open source machine learning accelerators
https://www.tensil.ai
Other
348 stars 28 forks source link

Fix matmul acc delay with lock #55

Closed tdb-alcorn closed 2 years ago

tdb-alcorn commented 2 years ago

This change fixes a bug where matmuls and accs would wait on each other instead of proceeding in parallel. The problem was that the control queues feeding them were implicitly linked, meaning that one could not get too far ahead of the other. We add buffers in these places to cut dependencies between buses fed by the same control inputs. In this case the buffer size was sometimes set to 2x the array size to account for the fact that accumulate operations can't begin until data has propagated through the array.

To prevent concurrency bugs from arising due to out of order execution, we also add a lock pool module that manages a set of mutexes that are shared by a number of actors. In our case, the mutexes are locks tied to memory regions (2 locks total) and the actors are the memory ports. The decoder acquires locks when it sends off memory requests, and the locks are automatically released when the appropriate condition is met - in this case when the memory request that needs to lock is actually passed into the memory module.

This change also required upgrading to Chisel 3.5.3 to get around some odd bugs that seemed to be due to the cloneType methods, which are no longer necessary in this version. To effect the upgrade, we had to fix some deprecation and naming issues, such as removing unnecessary parentheses from .fire, .isLit, or moving the verification statements out of the experimental package. We also had to update tests to reflect the new API of scalatest in the latest version (cosmetic differences only).

The meat of the change is in the LockPool module so please start there when reviewing.

Closing #52 in favour of this new PR

Closes #15

petrohi commented 2 years ago

There is Verilator compilation issue with ChiselTest 0.5.3 preventing proper testing with simulation. We reached to Chisel Gitter for help. In absence of simulation we tested on FPGA and confirmed correctness.

On FPGA we observed significant performance improvement (see line D):

  ResNet20 CIFAR ONNX YOLOv4 Tiny @416 ONNX ResNet50 ImageNet ONNX
A. ZCU104 baseline (ms) 13.099 294.071 514.279
B. A and external AXI4S Width Converter instead of Transmisson 11.168 262.804 473.956
C. B and two-cycle read/write fix 9.661 214.892 415.953
D. C and fix for delay between ops 7.387 120.748 256.987