Computer Architecture - Githubissues

[1] Instruction Set Architectures and Compilers

An Instruction Set Architecture (ISA) is an agreement about how software will communicate with the processor. A common scenario in an ISA has the following features:

A flat 32-bit address space
A set of registers available to the programmer.
A program counter register through which instructions are fetched, initialized to some documented value.
A set of external objects that can generate interrupts.

A description of the ISA for a processor answers the following questions:

What instructions are available?
What addressing modes are available?
What is the format of data?
How many and what kind of registers are available?
What condition codes, if any, are defined?
How are exceptions handled?
How are interrupts handled?

What are some things not specified by the ISA?

How fast will a particular instruction go?
How is an instruction implemented?
What are procedure calling conventions?
What are cache replacement policies?
What happens on a page fault?

Types of Instruction Sets

There are three main types of instruction sets:

Stack
Accumulator
General-purpose register

前两者目前均成为了历史，JVM从高层次抽象角度来讲可以视作stack-based ISA； x86是基于古老的有Accumulator特征的指令集，可以将它分类为 special-purpose register machine 。

general-purpose register(GPR)架构又主要分为两种：

Load/Store (or register-register). Only load and store instructions access main memory. The rest of the instructions act only on registers.
Register-memory. Any instruction may access main memory.

所谓的Reduced Instruction Set Computing(RISC)架构属于load/store architectures，比如SPARC, MIPS, Alpha AXP, PowerPC。 Complex Instruction Computing(CISC)架构通常是register-memory architectures，比如 VAX, x86, MC68000。

Memory Addressing

大多数计算机系统把内存按字节(8bit)划分，ISA决定如何把这些字节组合成更大的结构(比如32bit整数)，涉及到的有关方面如下：

Endianness. Memory can be accessed as either Big Endian or Little Endian.
Alignment. 一些ISA要求访问的内存地址必须是对齐的，比如Alpha上访问一个8字节数据如果地址不是8的倍数会产生一个异常。
Addressing modes. Addressing mode refers to the way in which a machine instruction accesses memory. Examples：
- PC-relative addressing. An immediate offset in the instruction is added to the program counter register to yield the effective address.
- Displacement addressing. An immediate offset in the instruction is added to a specified register to yield the effective address.
- Immediate addressing. An immediate value is specified.
- Indirect addressing (in combination with other modes). A first effective address is used to fetch a value from memory. That value is used to form a second effective address.

Operand Types

An operand is a value that an instruction operates on. 给定指令类型和寻址模式我们就可以指定指令的操作数。相关的操作数有：

Integers. Usually 8-bit (characters), 16-bit (words), 32-bit (doubleword), 64-bit (quadword).
Single and double precision floating point numbers, usually 32-bit and 64-bit respectively.
Binary-coded decimal. A single decimal digit occupies one half of a byte. Sometimes called packed decimal because decimal digits are packed together into bytes.
Strings. Some ISAs support variable-length strings of bytes as a primitive data type in memory.
Vectors of primitive types. Examples: CRAY vector processors, MMX extensions to x86.

Types of Instructions

Data transfer instructions. Eg, Load data from memory into registers, or store data from registers into memory.
Arithmetic and logical instructions. Perform arithmetic (e.g. add, subtract, multiply) and logic (e.g. AND, OR, XOR) as well as a combination of both (less than, greater than, compare).
Control transfer instructions. Instructions that affect the value of the program counter register. Unconditional jump, procedure call, return, conditional branch, indirect jump, software interrupt (e.g. trap)
Floating point instructions. Traditionally, instructions that deal with floating point values are given separate treatment.

Instruction Encoding How are instruction types, operands, addressing modes, etc. communicated to the hardware? The ISA specifies a binary encoding of instructions.

The assembler encodes programs using this encoding, and the microarchitecture reads and executes the encoded program.

以MIPS指令集为例：

Every instruction in the MIPS instruction set is 32-bit long. // MIPS中每个指令长度为32bit
The first six bits, bits 31-26, specify an opcode giving information about what that instruction is supposed to do.
MIPS registers and addresses are 64-bit.
MIPS is byte-addressable, requires aligned accesses, and can be switched to either Big Endian or Little Endian.

一共有三种类型的指令：

I-type instructions. Instructions with immediate operands. rt := rs op immediate. // 面向直接操作数的指令

___________________________________________________________________________
|_6-bit opcode_|_5-bit_rs_|_5-bit_rt_|________16-bit_immediate______________|

R-type instructions. Register-register arithmetic and logic instructions. // 面向寄存器的算术逻辑运算指令

___________________________________________________________________________
|_6-bit opcode_|_5-bit_rs_|_5-bit_rt_|_5-bit_rd_|_5-bit_shamt_|_6-bit_funct_|

J-type instructions. Jump to PC-relative address. Conditional jumps, jump-and-link, trap. // 跳转到PC的相对地址

___________________________________________________________________________
|_6-bit opcode_|_____________26-bit offset added to PC______________________|

Compilers Compiler把高级程序代码翻译成机器指令，通常包含以下部分：

Front end. 编译器前端接受程序代码并执行lexical analysis以及parsing把其转换成intermediate form。它输出程序代码的intermediate representation(IR)，比如abstract syntax tree 或者 three-address code。
High-level optimizations. 这个阶段会进行高级别的代码层面的优化(几乎不需要ISA相关的知识)，比如;
- Constant propagation, constant folding
- Redundancy elimination
- Loop transformations
- Procedure integration (automatic inlining)
- Dead and unreachable code elimination
Low-level optimizations. 这个阶段会把代码转换成lower-level IR(与ISA有一定关联)，比如：
- Strength reduction
- Machine idioms
- Register allocation
- Cache-concious loop transformations
Code generation. 这一步把IR转换为汇编代码，其中可能的优化包括：
- Code placement
- Low-level feedback-directed optimization, e.g. branch hints
- Instruction selection (e.g. mov 0 vs. xor)
- Peephole optimizations
Assembly. 这一步把汇编代码转化成可以执行的机器指令，可能会进行一些必要的对齐。

[2] Pipeline and ILP

Pipelined CPUs

let's consider a five-stage pipeline for a RISCV microprocessor.

IF: Instruction fetch. Fetch instructions from memory through the program counter (PC) and the PC is updated.
ID: Decode instructions. Read the register sources mentioned in the instruction from the register file. If the instruction is a jump, add the PC-relative offset (sign-extended) to the program counter.
EX: Execute the ALU instruction, or generate the effective address for a memory operation. Feed the ALU (arithmetic/logic unit) the register operands read in the previous stage and produce a result.
MM: If the instruction is a load or store, access the memory through the effective address generated in the previous stage.
WB: Write registers values generated in the EX or MM stages back to the register file.

Clock Frequency

The clock. A CPU, like many other kinds of digital circuits, marches through tasks to the beat of the clock. Every time the clock ticks, a new set of events occurs: some results are generated, some values are transmitted across busses.
The clock frequency. The clock frequency tells us how often the clock can tick. The faster the clock frequency is, the higher the throughput of the microprocessor will be. Clock frequency is measured in clocks per second, or these days, billions of clocks per second (GHz).
The clock period. The clock period is simply the inverse of the clock frequency. It's measured in seconds per clocks, or, these days, picoseconds per clock. It tells us the maximum gate delay that any pipeline stage may have, including latch delay for the special registers that buffer results from one stage to the next.
Gate depth and delay. The clock frequency depends on the depth of the circuits being clocked. If current must flow serially through many logic gates in a single clock cycle, the clock will be slower than if there are only a few gates in series.

Pipelining Increases the Clock Frequency

Imagine the circuitry of a simple processor. It must have gates that do all five stages of the five-stage pipeline I mentioned, even if it isn't pipelined. The clock signal must flow from the beginning of the circuit through the maximum-depth path of the circuit before the clock can tick again. If we divide the circuitry into five independent and balanced stages, the length of this path is divided by five, so the clock frequency can be multiplied by five. If we can find a way to divide the work into ten stages, then the clock can be multiplied by ten. This is the way it works ideally; in practice improvement is more modest. Some barriers to this "perfect" clock improvement are:
// 对于非pipelined处理器,电流每个周期必须通过每个stage,即使该周期某stage并未工作,故pipeline粒度越高时钟频率可提升的越高

Finding balance. t's difficult to divide the work of executing instructions into n stages that have exactly the same gate delay, and that delay is 1/nth that of the original design. The clock frequency is limited by the delay of the deepest stage.
Latch delay. Pipeline implementation includes latches or pipeline registers between each stage that communicate results from one stage to the next. As pipelines become deeper and clock rates increase, the delay of these latches becomes a significant component of the clock period.
Power. As the clock rate increases, the number of switching events per second in the processor increases. Improvements in cooling and power supplies (e.g. batteries) are much slower than improvements in clock rate, power and energy limit clock rates in today's processors.

Why Pipelining Works: Instruction-Level Parallelism

Pipelines work because many instructions executing sequentially are doing independent tasks that can be done in parallel.

This kind of parallelism is called instruction-level parallelism (ILP). In the simple pipeline we have seen, ILP can be hard to come by; however, there are many tricks people have invented for squeezing more ILP out of the instruction stream, like instruction reordering and speculation.

Obstacles to Pipelining: Hazards

We have seen a few physical limitations to pipelining. However, the three main difficulties with pipelining have to do with the nature of the instruction stream being executed. These hazards can prevent a pipeline stage from correctly carrying out its purpose.

Structural hazards. These occur when instructions contend for the same resources in the CPU. For instance, if the register file has only one write port, but for some reason the instruction stream has generated two writes to the register file in a single cycle, one of the offending pipeline stages will have to wait. Can often be solved by throwing more hardware at the problem, with the penalty of increased gate count, complexity, and possibly delay.
Data hazards. This happens when an instruction in the pipeline depends on data from another instruction that is also in the pipeline. For instance, consider these two instructions: <i>: add r1, r2, r3 // r1 := r2 + r3 <i+1>: add r4, r1, r5 // r4 := r1 + r5 There are many techniques for solving this problem. Forwarding (or bypass) is the main technique.
Control hazards. This happens when a control-flow transfer instruction depends on results that are not ready yet. For instance, every conditional branch presents a control flow hazard, since the condition isn't available in time to fetch the next instruction from the right place.

Dependences

As an introduction to data hazards, we will see the different ways that instructions can be dependent on one another.

Note that dependences are a property of programs. Not all dependences will affect the pipeline; we are really interested in dependence in a small window of instructions for pipelines.

However, the compiler uses dependence information over a much larger region to produce more efficient code.

Data dependence. Also called true dependence or flow dependence. This is where data needed by one instruction is produced by a previous instruction, or where data needed by one instruction flows through a chain of dependent instructions from some source.
Name dependences. This type of dependence occurs when two instructions use the same register or memory location, but there is no flow of data between the instructions.
- Anti-dependence. <i>: add r1, r2, r3 // r1 := r2 + r3 <i+1>: add r3, r4, r5 // r3 := r1 + r4 // 第二条指令写r3而第一条读r3, 因此处理器必须保证在第二条执行之前第一条指令读取到正确的值.
  here is an anti-dependence between the two instructions because the second instruction writes a register r3 that is used by the first instruction. The processor must guarantee that the first instruction reads the correct value before the second instruction overwrites it.
- Output dependence. This occurs when two instructions both write the same register.

When these dependences occur in such a way that they are exposed to the pipeline, three different types of data hazards may occur:

RAW, or read-after-write. An instruction tries to read an operand before a previous instruction has a chance to write it. This is caused by a true dependence.
WAW, or write-after-write. An instruction tries to write an operand before a previous instruction has a chance to write it. This is caused by an output dependence.
WAR, or write-after-read. An instruction writes to an operand before it can be read by a previous instruction, so the previous instruction incorrectly gets the new value. This is caused by an anti-dependence.

Solutions to Data Hazards

Pipeline stall cycles. Can resolve any type of hazard. Freeze the pipeline up to the dependent stage until the hazard is resolved.
Forwarding (bypass). If the data is available elsewhere in the pipeline, then there is no need to stall. When the dependence is detected, the data is forwarded directly to the consuming pipeline stage.

katsusan / gowiki

Computer Architecture #17