This weekend I took a break from the decompiler/compiler stuff and started looking at VU1 stuff. The bad news is there's a lot of VU micro mode code. We're not at the point where we need this yet, but we should start thinking about it now. I wrote a VU micro mode disassembler, found all the VU programs I could, and disassembled them.

What is VU micro mode?

VU "micro" mode is when a VU runs as a separate processor. Each VU has integer and vector float registers. VU0's vector float registers are the same as the EE vf registers. Each VU has a program memory for storing a microprogram and a data memory that can be read/written by the microprogram. These memories can be filled through DMA (VIF).

VU micro instruction are 64 bit and contain two instructions. One instruction (upper instruction) is a vector instruction very similar to the VU0 macro instructions. For example "vaddx.yw vf1, vf3, vf5" or "vclip". The other instruction (lower instruction) is mostly integer or control flow stuff like branch, integer math, integer load and store. There are a few weird lower floating point instructions like divide and square root.

The VUs run at 300MHz, and in the best case, execute a single 64-bit instruction per cycle.

VU0 vs. VU1

The PS2 has two VU's: VU0 and VU1.

VU0

The VU0 is rarely used in micro mode. Most games don't use it in micromode, but of course Jak does. Compared to VU1, it has smaller code and data memories. When VU0 is executing in micro mode, you cannot use any vector instructions on the EE. Also, the vf registers on the EE are the same as the vf registers in VU0 micromode.

In Jak, it seems like VU0 is always activated directly from EE code, not as part of the main graphics DMA list. The usual use is to run vector floating point code on VU0 while the EE does something else (accessing main memory, etc). It generally does math (like collision or transforms) and not directly graphics stuff.

VU1

VU1 has a few special features. It has an "elementary function unit" that can compute 1/x^2, arctan, sine, 1/sqrt(x), etc. It also has a special connection to the GS. There is a magic "xgkick" instruction that transfers a GS packet from VU memory directly to the GS. The usual usage is to build up some primitives to send to the GS, then run xgkick.

Typical Setup

Typically VU0 has its program/data memory loaded by doing an immediate VIF0 DMA transfer. Then the EE uses the vcallms instruction to start VU0 at a specific point in the VU0 memory. In Jak, it seems like the result is stored in VU0's vf registers, which are shared with the EE.

Typically VU1 has its program/data memory loaded as part of the graphics DMA chain. The program is started by a special VIFcode contained within the DMA list. There is a built-in feature to double buffer data loading. Half of the VU data memory will be updated by DMA (EE -> VU data) while the other half is being used by a currently executing microprogram.

It's also common for the microprogram to double-buffer GS packets with xgkick. While one GS packet is being xgkicked, a second one will be built.

How to port?

It's likely that we will want change how these work. The format used by the GS won't be efficient when used with a modern graphics pipeline. So we should look for ways to understand and rewrite the renderers.

Unfortunately there are 19 microprograms, and many of them are pretty complicated. Here are a few strategies we could take:

Port to OpenGOAL vector instructions

Some programs use only instructions that are identical to existing OpenGOAL vector instructions. It would be pretty easy to just port these to OpenGOAL. We'd probably need to build a tool to actually do the mapping, but it shouldn't be too hard. There are a few complications:

In the original game, these would run independently from the EE code. There may be synchronization issues that must manually be solved.
In OpenGOAL, vector registers aren't preserved across functions. We will need to manually backup and restore vector registers in places where the game relies on this.
Some instructions may not port exactly, like loading/storing data, checking "flags", or branching.
OpenGOAL's implementation may not be that efficient due to having fewer vector registers on x86 than on PS2 and inefficient stack spills.

I think this approach will work for all VU0 programs. These functions do not create GS data directly, so we want to make them behave exactly like the original. They also run directly from the EE code, so it would be easy to have them in OpenGOAL. There are only a handful of these programs and they are pretty short so it would be reasonable to fix up small issues mentioned above manually.

Understand and rewrite

This strategy would be to understand what a VU1 program does, and reimplement that functionality in C++/shaders as part of the graphics system. When the graphics system hits a VU1 program in the DMA list, it would detect which one it is, then execute some C++ function instead. The C++ function will be designed to eventually draw this data with a glDrawElements or similar, instead of the PS2's GS, so it can bypass the GS format. Ideally most of the floating point math could be done in a shader.

This strategy would be awesome - it would be efficient (C++ implementation designed for PC graphics) and easy to make modifications, but it might not be realistic. We might be unable to understand how the VU code worked.

Port to C++

This approach would be to build a tool that translates a VU program into (really messy) C++ code. The C++ code would have the exact same behavior as the original. One advantage with this approach is that you could test this within PCSX2 (with some hacks). But there are a number of disadvantages:

Have to deal with the GS format at the output
Transformations can't be done on PC GPU
Hard to understand what's going on and modify things
General VU -> C++ translator is a similar level of difficulty as a VU interpreter, which is hard.

List of Programs

1 Background VU0 (60 instructions)

This program contains 6 separate functions. All are simple, with no branching. Instructions are add, multiply, load, and store, and a single flag check.

I believe the program is intended to work like this:

Before "background" rendering starts, run the first program. This stores a camera transformation in VU0 data memory.
There are a few programs that simply load the stored transformation (or part of it) back into vf registers, probably for use on the EE.
There are two programs that transform some points in vf registers.

For example, after running the first program, background renderers may do something like: vcallms 336 to transform the point currently stored in vf2. Some time later, the vf6 register will have the result.

2 Bones VU0 (63 instructions)

This program contains two functions. Both are very simple, with no branching. Instructions are adds/multiplies/opmula and a single DIV.

3 Ocean VU0 (72 instructions)

This program contains a single function. Simple, with no branching. Has add/mul/sub/min/div.

4 Shadow VU0 (88 instructions)

3 functions. No branching. Add/mul/sub/opmula/div. Also has some move instrudtions.

5 Collide VU0 (90 instructions)

3 functions. All lower. No branching. add/mu/sub/ftoi/itof.

6 Mercneric VU0 (201 instructions)

Unclear how many functions. 9 e-bit instructions. Uses data memory, integer stuff, and branching.

7 Generic VU0 (295 instructions)

14 function. No branching. Uses integer stuff and data memory.

8 Sprite Distort VU1 (65 instructions)

Single function. Two nested loops. Inner loop assembles a GS packet. Outer loop does a single XGKICK.

9 Ocean Texture VU1 (152 instructions)

Two functions, start points are near the top with a pattern like this:

;; start at 0 for L1's program
  b L1                       |  nop                      
  nop                        |  nop
;; start here for L2's program                      
  b L2                       |  nop

Uses integer stuff, data memory, link/return, XGKICK and XTOP.

10 Sky VU1 (215 instructions)

Two functions, but doesn't use the above pattern. Uses integer stuff, branching, xgkick, etc.

11 Shrub VU1 (681 instructions)

6? functions. Doesn't use the top branch pattern. This feels like one that's too big to do manually and will definitely need tools.

12 Shadow VU1 (792 instructions)

3 functions. Uses the top branch pattern.

13 Sprite VU1 (898 instructions)

14 Tnear VU1 (957 instructions)

15 Tie VU1 (1037)

16 Generic VU1 (1178)

17 Tie Near VU1 (1892)

18 Tfrag (2008)

19 Merc (2198)

What did they do on PS3/VITA?

It seems like VU1 stuff became shaders and C++ code. They accidentally included some debug output from the shader compiler in the PS3 output. This output has a list of inputs outputs, then some disassembly of the shaders, like:

MUL R0, v[0].y, c[1];
MOV o[7].xy, v[8].xyxx;
MAD R0, v[0].x, c[0], R0;
MAD R0, v[0].z, c[2], R0;
MAD o[0], v[0].w, c[3], R0;
END

It's interesting to see that most of these shaders are super simple. One of the weirder ones is merc, which became about 150 different shaders. They all have names like Merc_ps3_S1_M1_A1_E1_F0_I0_D0 where the numbers after the letters change depending the configuration. The "bone array" calculation was done on the GPU for merc, but the rest seem to be really simple. For example tfrag is just a single transformation + some reasonably simple color interpolation. It seems like most of the stuff was done on the CPU, which is maybe a sign that these VU programs have stuff that you can't fit into a shader.

Pipeline junk

The VUs are pipelined, meaning that the result of an instruction is not available until several cycles later. Unlike most pipelined CPUs, the exact operation of the pipeline must be understood in order to write a correct program. Sony famously recommended using Microsoft Excel to lay out cycle-by-cycle diagrams for VU programs.

Most of the instructions use the "FMAC pipeline". With this pipeline, the result is available on the 4th cycle after an instruction.

mul vf5, vf10, vf20
nop ;; takes 1 cycle
nop ;; takes 1 cycle
nop ;; takes 1 cycle
mul vf3, vf5, vf20 ;;vf5 is ready

However, if you try to use the result too early, the CPU will stop executing instructions and wait until the result is ready.

mul vf5, vf10, vf20
mul vf3, vf5, vf20 ;; this will stall for 3 additional cycles in order to wait for vf5

The VUs have 4x FMAC units, so you can have multiple muls in flight at the same time:

mul vf1, vf2, vf3
mul vf4, vf5, vf6
mul vf7, vf8, vf9
mul vf10, vf11, vf12
mul vf13, vf1, vf2    ;; vf1 is done, can use it without a stall
mul vf14, vf4, vf3    ;; vf4 is done, can use it without a stall

You can end up doing some funny stuff with instructions that write to a common register.

Here is an example from the tie-near renderer. The clip instruction writes to a clipping flag that can then be read with fcget.

  nop                        |  clipw.xyz vf18, vf18     
  move.xyzw vf18, vf07       |  clipw.xyz vf19, vf19     
  move.xyzw vf19, vf08       |  clipw.xyz vf20, vf20     
  move.xyzw vf20, vf05       |  nop                      
  fcget vi10                 |  addx.xyz vf24, vf11, vf00 ;; vi10 gets the result of clip vf18
  fcget vi11                 |  clipw.xyz vf08, vf08     ;; vi11 gets the result of clip vf19
  fcget vi12                 |  clipw.xyz vf05, vf05   ;; vi12 gets the result of clip vf20

It takes 4 cycles for clip to complete, so the first fcget gets the value of the first clip, the second fcget gets the value of the second clip, and so on. This gets even harder to analyze when you take into account that instructions between the clip and the fcget may stall, so you can't just count instructions to determine how many cycles are in between two instructions. If the addx in the above example were to stall, say because it was waiting on vf11 to be computed, then the behavior would be different.

There are also other pipelines that work very differently. The div pipeline handles div/sqrt/rsqrt. Unlike FMAC, there is only one DIV unit, so you can't have multiple DIVs in flight at a time. Also, unlike FMAC, the CPU will not stall when you try to access a result before its ready. It will stall if you try to do a second div before its ready. It's possible to do stuff like this:

div Q, blah
;; if you read Q here, immediately after the div is issued, it will have the result of the old division still

There are also some undocumented behaviors. For example:

  sqi.xyzw vf17, vi08        |  sub.xyzw vf16, vf15, vf14
  sqi.xyzw vf20, vi08        |  sub.xyzw vf19, vf18, vf17
  sqi.xyzw vf14, vi08        |  sub.xyzw vf22, vf21, vf20
  ibeq vi07, vi08, L22       |  nop

You are not supposed to do this. The last sqi instruction increments the value of vi08 and the ibeq immediately after reads the value. I believe the correct behavior is that ibeq gets the value of vi08 before the first sqi. Which is crazy.

open-goal / jak-project

[project] VU micro mode code #518