tensil-ai / tensil

Open source machine learning accelerators
https://www.tensil.ai
Other
348 stars 28 forks source link

TCU threads in compiler and emulator #41

Closed petrohi closed 2 years ago

petrohi commented 2 years ago

Very big PR that adds threading support to the compiler and emulator (former golden processor). The key new components are lir.Parallelizer and lir.Sequencer. lir.Parallelizer is used by the backend in writeSegments phase to interlace LIR from init-load-save and compute segments. This is achieved by looking at a moving window of 3 partitions and taking save segment from the first, compute from the second, and init-load from the third. lir.Sequencer is used by the emulator to convert the program LIR into sequence of LIR events with one read and one write event for every instruction. This way it is possible to prove program correctness with respect to overlapping instructions caused by concurrency between threads. Many additional new components are the result of overall backend refactoring and introduction of LIR trait and LIR parser.

This PR also changes instruction format to include TID in the header in a way that makes all single-threaded and some dual-threaded programs compatible with non-threaded TCU. The exception of Configure instruction having its opcode changed, but compiler does not generate this instruction. This is achieved by taking higher bit from opcode to represent the TID. In single-threaded case the TID has length of 0 bits, which turns this higher bit to padding. In dual-threaded case the TID has length of 1 bit, so that if the entire program is issued for thread 0 it remains compatible with non-threaded TCU.

Additional smaller changes:

Following is the summary of performance improvements between single- and dual-threaded TCU for YOLOv4 Tiny 416x416 on standard Ultra96v2 architecture for given number of cycles per vector transfer from and to DRAM (DRAM latency):

DRAM latency Single-threaded cycles Dual-threaded cycles Improvement
1 20190954 17313606 14.25%
2 23447341 17891353 23.70%
4 29960115 21174266 29.33%
8 42985663 29530242 31.30%
16 69036759 53555314 22.42%

Graph of cycle-cost of every parallelized window for YOLOv4 Tiny 416x416 on standard Ultra96v2 for DRAM latency of 8:

image

shortcut-integration[bot] commented 2 years ago

This pull request has been linked to Shortcut Story #441: TCU threads in compiler and emulator.

CLAassistant commented 2 years ago

CLA assistant check
All committers have signed the CLA.