VRCP/VRSQ pseudocode descriptions wrong

I was trying to understand how the RSP divide instructions vrcp and vrsq work, based on the documentation at doc/rsp.md. It appears the pseudocode describing them is incorrect. I did this analysis based on vrsq but it appears the same issue exists in the vrcp documentation as well.

First of all--

scale_out = highest_set_bit(x)
scale_in = 32 - scale_out
17_bit_output = 1 || RSQ_ROM[scale_in(0) || x(scale_in-1..scale_in-8)]

This can't possibly be correct. The bits being looked at to address the ROM must be the bits immediately below the highest set bit. For example, for an input 7FFE.0000, bit 30 is the highest set bit and bits 29-22 (here all 1s) determine the mantissa of the output. According to the code as written, scale_out = 30, scale_in = 2, and the bits being looked at are bits 1, 0, and then off the right side of the number. The state of bits 1 and 0 in this number obviously should not affect the rsqrt result at all. Instead the code should be

scale_in = highest_set_bit(x)
17_bit_output = 1 || RSQ_ROM[scale_in(0) || x(scale_in-1..scale_in-8)]

Now, for the output scale, I can't 100% confirm or prove where the bits should be as I'm not (yet) testing any of this on actual hardware. But in order to match the statement in the SGI docs about where the radix point goes, which also happens to be the choice that produces the largest range of output values that have "full precision", the code would have to be:

shift_out = scale_in >> 1
result(31..0) = (17_bit_output || 15 zeros) >> shift_out

Or to write it more like the current pseudocode, though I think this is more confusing:

shift_out = 31 - (scale_in / 2)
result(shift_out:shift_out-16) = 17_bit_output

Finally, I'd suggest putting some additional explanatory text and examples:

If the input number is S15.16 format, the output is S7.24 format. Alternatively, if the input number is in "0.S31" format (i.e. 80000000 represents -0.5 and 7FFFFFFF represents about 0.49999, which is what you get if you use single precision with vmulf etc. and then use $v0 as the least significant bits), the output is in S15.16 format, which is very convenient for further manipulation. Some examples:

40000000
si = 30, table index = 000, result = 1FFFF, right shift 15
final result: 0001FFFF

which is approximately 00020000. So this can be rsqrt(4000.0000) = 00.020000, or rsqrt(0.40000000) = 0002.0000 (with these all in hex).

10000000, si = 28, table index = 000, result = 1FFFF, right shift 14, final result 0003FFFE or about 00040000

20000000, si = 29, table index = 100, result = 16A09, right shift 14, final result 0002D412
calculator says 1/sqrt(0x0.2) = 2.D413CCCFE, so this is correct

There's also a typo where rcp is used at the bottom of rsq, though that's pretty easy to not get caught up on.

I was trying to understand how the RSP divide instructions vrcp and vrsq work, based on the documentation at doc/rsp.md. It appears the pseudocode describing them is incorrect.

That can be. What is surely correct, instead, is both the code in this repo (see vrcp.rs) and the code in Ares (https://github.com/ares-emulator/ares/blob/98bd18757b0b5085a7c16a238aa89a0328e31a69/ares/n64/rsp/interpreter-vpu.cpp#L1255-L1276) as both underwent bruteforce testing against real hardware. So when in doubt, you can fully trust either implementation.

This can't possibly be correct. The bits being looked at to address the ROM must be the bits immediately below the highest set bit. For example, for an input 7FFE.0000, bit 30 is the highest set bit and bits 29-22 (here all 1s) determine the mantissa of the output.

Yes, that is surely a typo. Basically, scale_in and scale_out are swapped. I have pushed a fixed version.

Finally, I'd suggest putting some additional explanatory text and examples:

Thanks for these. I'm slowly moving the RSP documentation to https://n64brew.dev/wiki/Reality_Signal_Processor/CPU_Core as I want to eventually archive this version. In that page, the structure includes coding examples and suggestions for homebrew developers to really understand how each instruction should be used. This document was born more for emulator authors that don't really need to understand what "VMADN" is and how it can be used in the context of a T&L pipeline, as long as they can simulate the bit exact behavior.

Thanks for your reply!

I'm not familiar with Ares or r64emu; I found your doc through it being posted in the armips thread about undocumented behaviors of the memory instructions, which I was trying to use. Then when I needed to use vrsq and didn't find enough info in the official SGI docs, I came back to the same doc.

It seems that perhaps the issue arose from highest_set_bit and clz being opposite operations.

Basically, scale_in and scale_out are swapped. I have pushed a fixed version.

The fixed version is still not correct:

scale_in = highest_set_bit(x)
scale_out = 32 - scale_in
scale_in = scale_in / 2

It is scale_out which has the / 2 applied to it, but that is before the 32 -.

I come from the modding side of the N64 community, which is generally completely oblivious to the homebrew side of the community. So, I was not aware that other RSP documentation was available. In fact, an RSP microcode disassembler was only just developed a year or so ago (and it has problems with segmented addresses, which had to be fixed by hand), so modding existing microcodes could not really have been done before then. Most modders won't even consider using modded microcodes, cause that will break the default HLE plugins in popular emulators and thus cut out 90% of the audience for their mods.

Anyway, thanks for the link to the N64brew wiki! It looks like there's much more info on the doc here at least for now though.

I'm not familiar with Ares or r64emu;

r64emu is abandoned, it's not archived because of this doc. Ares is active: https://ares-emu.net It is currently one of the most accurate n64 emulators, including virtually perfect RSP emulation.

The fixed version is still not correct:
scale_in = highest_set_bit(x)
scale_out = 32 - scale_in
scale_in = scale_in / 2
It is scale_out which has the / 2 applied to it, but that is before the 32 -.

It's true that it's scale_out, but it should be after the 32 - I think. Why you think otherwise?

I come from the modding side of the N64 community

Thanks for the background. Yeah the two communities are mostly separated for now though there are some crosstalks.

Let me know if you find something else, thank you for the review.

Sorry for the delay in following up on this.

Comparing to: https://github.com/rasky/r64emu/blob/master/src/sp/vrcp.rs#L11

The line in your pseudocode version doing 32 - scale_in is not equivalent to the Rust line rshift = 32 - lshift. The pseudocode does not need this "inversion" of the shift quantity, because it is using the function highest_set_bit whereas the Rust version is using leading_zeros (which are opposites).

The pseudocode line which should be doing 32 - scale_out happens after the halving of the shift amount (rshift >>= 1), in the line ((0x10000 | (rom as u32)) << 14) >> rshift. In the pseudocode, this is selecting which bit of the output variable to start at, which is effectively a left shift operation. That's why the value must be the opposite of the right shift value from the Rust version.

I would suggest either completely changing the pseudocode to match the Rust way the algorithm is written, or carefully double-checking that the pseudocode version is correct.

rasky / r64emu

VRCP/VRSQ pseudocode descriptions wrong #13