TCU threads in compiler and emulator

petrohi commented 2 years ago

Very big PR that adds threading support to the compiler and emulator (former golden processor). The key new components are lir.Parallelizer and lir.Sequencer. lir.Parallelizer is used by the backend in writeSegments phase to interlace LIR from init-load-save and compute segments. This is achieved by looking at a moving window of 3 partitions and taking save segment from the first, compute from the second, and init-load from the third. lir.Sequencer is used by the emulator to convert the program LIR into sequence of LIR events with one read and one write event for every instruction. This way it is possible to prove program correctness with respect to overlapping instructions caused by concurrency between threads. Many additional new components are the result of overall backend refactoring and introduction of LIR trait and LIR parser.

This PR also changes instruction format to include TID in the header in a way that makes all single-threaded and some dual-threaded programs compatible with non-threaded TCU. The exception of Configure instruction having its opcode changed, but compiler does not generate this instruction. This is achieved by taking higher bit from opcode to represent the TID. In single-threaded case the TID has length of 0 bits, which turns this higher bit to padding. In dual-threaded case the TID has length of 1 bit, so that if the entire program is issued for thread 0 it remains compatible with non-threaded TCU.

Additional smaller changes:

Rename golden processor to emulator;
Fix TF and ONNX frontend to include missing fused layers in dot graphs;
Introduce lir.StatsGen to estimate multi-threaded LIR and use it to collect separate Stats in the scheduler (for fused layers) and the backend (for the entire program);
Adjust predefined architectures to include number_of_threads=1 and thread_queue_depth=8;
Add missing copyright/license comments.

Following is the summary of performance improvements between single- and dual-threaded TCU for YOLOv4 Tiny 416x416 on standard Ultra96v2 architecture for given number of cycles per vector transfer from and to DRAM (DRAM latency):

DRAM latency	Single-threaded cycles	Dual-threaded cycles	Improvement
1	20190954	17313606	14.25%
2	23447341	17891353	23.70%
4	29960115	21174266	29.33%
8	42985663	29530242	31.30%
16	69036759	53555314	22.42%

Graph of cycle-cost of every parallelized window for YOLOv4 Tiny 416x416 on standard Ultra96v2 for DRAM latency of 8:

shortcut-integration[bot] commented 2 years ago

This pull request has been linked to Shortcut Story #441: TCU threads in compiler and emulator.

CLAassistant commented 2 years ago

All committers have signed the CLA.

tensil-ai / tensil

TCU threads in compiler and emulator #41