Very big PR that adds threading support to the compiler and emulator (former golden processor). The key new components are lir.Parallelizer and lir.Sequencer. lir.Parallelizer is used by the backend in writeSegments phase to interlace LIR from init-load-save and compute segments. This is achieved by looking at a moving window of 3 partitions and taking save segment from the first, compute from the second, and init-load from the third. lir.Sequencer is used by the emulator to convert the program LIR into sequence of LIR events with one read and one write event for every instruction. This way it is possible to prove program correctness with respect to overlapping instructions caused by concurrency between threads. Many additional new components are the result of overall backend refactoring and introduction of LIR trait and LIR parser.
This PR also changes instruction format to include TID in the header in a way that makes all single-threaded and some dual-threaded programs compatible with non-threaded TCU. The exception of Configure instruction having its opcode changed, but compiler does not generate this instruction. This is achieved by taking higher bit from opcode to represent the TID. In single-threaded case the TID has length of 0 bits, which turns this higher bit to padding. In dual-threaded case the TID has length of 1 bit, so that if the entire program is issued for thread 0 it remains compatible with non-threaded TCU.
Additional smaller changes:
Rename golden processor to emulator;
Fix TF and ONNX frontend to include missing fused layers in dot graphs;
Introduce lir.StatsGen to estimate multi-threaded LIR and use it to collect separate Stats in the scheduler (for fused layers) and the backend (for the entire program);
Adjust predefined architectures to include number_of_threads=1 and thread_queue_depth=8;
Add missing copyright/license comments.
Following is the summary of performance improvements between single- and dual-threaded TCU for YOLOv4 Tiny 416x416 on standard Ultra96v2 architecture for given number of cycles per vector transfer from and to DRAM (DRAM latency):
DRAM latency
Single-threaded cycles
Dual-threaded cycles
Improvement
1
20190954
17313606
14.25%
2
23447341
17891353
23.70%
4
29960115
21174266
29.33%
8
42985663
29530242
31.30%
16
69036759
53555314
22.42%
Graph of cycle-cost of every parallelized window for YOLOv4 Tiny 416x416 on standard Ultra96v2 for DRAM latency of 8:
Very big PR that adds threading support to the compiler and emulator (former golden processor). The key new components are
lir.Parallelizer
andlir.Sequencer
.lir.Parallelizer
is used by the backend inwriteSegments
phase to interlace LIR from init-load-save and compute segments. This is achieved by looking at a moving window of 3 partitions and taking save segment from the first, compute from the second, and init-load from the third.lir.Sequencer
is used by the emulator to convert the program LIR into sequence of LIR events with one read and one write event for every instruction. This way it is possible to prove program correctness with respect to overlapping instructions caused by concurrency between threads. Many additional new components are the result of overall backend refactoring and introduction of LIR trait and LIR parser.This PR also changes instruction format to include TID in the header in a way that makes all single-threaded and some dual-threaded programs compatible with non-threaded TCU. The exception of
Configure
instruction having its opcode changed, but compiler does not generate this instruction. This is achieved by taking higher bit from opcode to represent the TID. In single-threaded case the TID has length of 0 bits, which turns this higher bit to padding. In dual-threaded case the TID has length of 1 bit, so that if the entire program is issued for thread 0 it remains compatible with non-threaded TCU.Additional smaller changes:
lir.StatsGen
to estimate multi-threaded LIR and use it to collect separateStats
in the scheduler (for fused layers) and the backend (for the entire program);number_of_threads
=1 andthread_queue_depth
=8;Following is the summary of performance improvements between single- and dual-threaded TCU for YOLOv4 Tiny 416x416 on standard Ultra96v2 architecture for given number of cycles per vector transfer from and to DRAM (DRAM latency):
Graph of cycle-cost of every parallelized window for YOLOv4 Tiny 416x416 on standard Ultra96v2 for DRAM latency of 8: