hrydgard / ppsspp

A PSP emulator for Android, Windows, Mac and Linux, written in C++. Want to contribute? Join us on Discord at https://discord.gg/5NJB6dD or just send pull requests / issues. For discussion use the forums at forums.ppsspp.org.
https://www.ppsspp.org
Other
11.25k stars 2.17k forks source link

Improving accuracy of vfpu_sin #16946

Open fp64 opened 1 year ago

fp64 commented 1 year ago

What should happen

This is a work-in-progress/research topic, I just thought I'd put it there for people to see.

This concerns implementation of y=vfpu_sin(x), which computes an approximation of $\sin(\frac{\pi}{2} x)$.

Technical: I'm sometimes using hex-floats notation for clarity.

Looking at the table the following things appear to be true:

  1. For $|x|>2^{32}$ the output seems garbage (small non-zero values), despite x always being even integer in this case. How does one even get this wrong? Then again, on x87 fsin instruction returns x for $|x|>2^{64}$. Hopefully, games do not rely on this.
  2. For inf/nan the output is specific nan (0x7F800001 or 0xFF800001, the same sign bit as x). Note, that this appears to be a signaling nan, though apparently some processors treat it backwards.
  3. The floating point output y always has two lowest bits as 0 (next one may be 0 or 1) .
  4. The usual symmetries are obeyed, which allows to safely reduce x to $[0;1]$. Note: result is -0 when x is in [...,-6, -2, -0, +2, +6,...].
  5. The output is uniquely defined by trunc(x*0x1p+23f) - i.e. input is effectively ??:23 fixed point.
  6. Output is always multiple of $2^{-28}$, i.e. y*0x1p+28f is integer.

The above two points mean that the problem is reduced to essentially integer: a -> b, where a=int(0x1p+23f*x), and b=int(0x1p+28f*y). This gives us a more manageable table of $2^{23}+1$ entries. Attached below: table.bin.zip The contents (in binary) are raw values of b as 32bit LE integers (i.e. 4 bytes at offset 4*a are the value of b that corresponds to a). Where the value is unknown (absent in original table) 0xFFFFFFFF is used (not possible to mistake for a real output, since it is larger than $2^{28}$). The table is dense for x>=0.25, but has only 60 values below:

Show ```c // a, b { { 0, 0}, { 1, 50}, { 2, 100}, { 3, 150}, { 4, 201}, { 5, 251}, { 6, 301}, { 7, 351}, { 8, 402}, { 9, 452}, { 10, 502}, { 11, 552}, { 12, 603}, { 13, 653}, { 14, 703}, { 15, 753}, { 16, 804}, { 32, 1608}, { 63, 3166}, { 64, 3216}, { 129, 6484}, { 255, 12817}, { 256, 12867}, { 258, 12968}, { 516, 25937}, { 1023, 51422}, { 1024, 51472}, { 1033, 51924}, { 2066, 103849}, { 4095, 205839}, { 4096, 205889}, { 4132, 207699}, { 4608, 231625}, { 8264, 415397}, { 16383, 823501}, { 16384, 823552}, { 16529, 830840}, { 22528, 1132380}, { 33059, 1661715}, { 65535, 3294065}, { 65536, 3294116}, { 66118, 3323368}, { 132237, 6646276}, { 149130, 7495120}, { 262143, 13171452}, { 262144, 13171504}, { 264474, 13288480}, { 298260, 14984388}, { 528948, 26544376}, { 596520, 29922048}, { 1048575, 52369104}, { 1048576, 52369152}, { 1057896, 52828544}, { 1168434, 58264416}, { 1193040, 59471152}, { 1310720, 65224496}, { 1435360, 71283488}, { 1572864, 77922688}, { 1714176, 84691872}, { 1917732, 94337248}, }; ```

Note, that for $|x| < 10^{-3}$ we have $|\sin(\frac{\pi}{2} x)-(\frac{\pi}{2} x)| < 2^{-28}$.

Assuming we can satisfactorily patch the blank spots in the table, this may turn into a compression problem, i.e. even without knowing what HW actually does we just need a way to recreate the exact table. It is, in principle, possible to bite the bullet and just (optionally) ship the table with PPSSPP (not that I actually advocate that) - it is 4MB zipped, and 32MB in mem, which while not nice, might be tolerable.

Please do take this with a grain of salt, it is possible I missed (or messed up) something.

Who would this benefit

If this does indeed improve emulation accuracy, it may make more games correctly playable, to benefit of everyone concerned.

Platform (if relevant)

None

Games this would be useful in

Games that rely on specific values produced by vfpu, not sure which

Other emulators or software with a similar feature

No response

Checklist

hrydgard commented 1 year ago

Cool. So we should have the PSP generate a new table from 0.23 fixpoint, from 0 to 1 would be enough (or even 0 to 0.5).

Then I guess theoretically we can try to compute first, second and third derivatives of that sequence. If those series becomes a series of straight lines or plateaus at some derivative, we'll know we're dealing with polynomial interpolation of a much smaller table, which I think might be likely. Though I don't know if precision is enough for those derivatives to make sense...

Storing the full table will hopefully not be necessary, though by taking the derivative before compressing, the result should be really compressible... (actually that might be nonsense given how sin and cos are each other's derivative, sometimes with flipped signs, but we have to be dealing with an approximation here...)

fp64 commented 1 year ago

from 0 to 1 would be enough (or even 0 to 0.5).

To be clear

The table is dense for x>=0.25

means, that all values for [0.25;1] are already known. Only [0;0.25] remains.

Storing the full table will hopefully not be necessary

For one thing 32MB might be larger than L3 cache.

though by taking the derivative before compressing, the result should be really compressible...

Tried it:

 33554436  table.bin
  3430864  table.bin.zip
  4645219  table.bin.zst
 33554436  delta1.bin
   171407  delta1.bin.zip
   238125  delta1.bin.zst
 33554436  delta2.bin
   175539  delta2.bin.zip
   220832  delta2.bin.zst

delta1 and delta2 are first and second finite differences. Both zip and zstd were invoked with -9.

hrydgard commented 1 year ago

Oh yeah, didn't read carefully enough, so a smaller range to dump.

Yeah, that got really compressible - but that it didn't get even more compressible at the second derivative .. well, not sure what it means to be honest :P

By the way, 4 way symmetry is good, but 8 way symmetry might be possible with some flipping around of signs and offsetting. Or hm, maybe not...

fp64 commented 1 year ago

Note: result is -0 when x is in [...,-6, -2, -0, +2, +6,...].

Grr, no, it's [...,-8,-4,-0,+2,+6,...]: both sources of signflip (signbit(x) and (x mod 4==2)) just XOR with each other.

unknownbrackets commented 1 year ago

Oh, hm. Maybe I overwrote that file I uploaded before with a smaller subset. It took a lot of time to generate, I thought I saved it... anyway, I had dumped some of it as binary. Let me just regenerate, at least for exponents between 0x60 - 0x7F....

-[Unknown]

unknownbrackets commented 1 year ago

https://forums.ppsspp.org/uploads/sin_60-7F.zip https://forums.ppsspp.org/uploads/cos_60-7F.zip

Here's all the exponents from 60-7F, each as a separate file. 60-67 aren't that interesting, just included to represent. The data is simply input, output repeated in 4 byte pairs.

-[Unknown]

fp64 commented 1 year ago

Here's all the exponents from 60-7F, each as a separate file.

Thanks!

Yeah, that got really compressible - but that it didn't get even more compressible at the second derivative .. well, not sure what it means to be honest :P

Well, after the first delta we already get long repeats of the same value, e.g. 50,50,50,50,50,49,49,49,49,49. Second delta gets us long repeats of zeroes, e.g.: 0,0,0,0,0,-1,0,0,0,0 which isn't much better for compression, I think (e.g. for LZ the first example is something like LITERAL(50),COPY(-1,4),LITERAL(49),COPY(-1,4), and the second is LITERAL(0),COPY(-1,4),LITERAL(-1),LITERAL(0),COPY(-1,3) - so one command more).

fp64 commented 1 year ago

Converted the provided files. So, now we have a complete fixed23->fixed28 table for sin: sin.bin.zip Again, just uint32 outputs (input is index, i.e. offset in file divided by 4), LE, in binary.

Haven't yet verified that cos uses the same table (reversed).

Again, thanks for actually dumping the data!

fp64 commented 1 year ago

The draft implementation of the reduction:

static uint32_t vfpu_sin_table[(1<<23)+1]={
#include "sin.bin.h" // Or #embed, or whatever.
};

static inline uint32_t vfpu_sin_fixed(uint32_t arg)
{
    // Table-based for now.
    return vfpu_sin_table[arg];
}

// For |x|<2^32 should produce exactly the
// same value (bitwise) as PSP. For finite
// x>=2^32 produces 0, whereas PSP sometimes
// produces junk (which looks like small
// non-zero values). Hopefully games do not rely
// on this. For inf/nan produces qNaN, but
// with a different payload than PSP (PSP produces
// sNaN).
static inline float vfpu_sin(float x)
{
    // Handle large/non-finite values.
    if(!(fabsf(x)<0x1p25f)) return copysignf(x-x,x); // Beware -ffast-math.
    // Reduce to [-2;+2].
    int32_t n=int32_t(0.5f*x); // Rounds to 0.
    x=x-2.0f*float(n); // Exact, no roundoff. Also, -0 remains -0.
    // Flip sign according to half-period.
    if(n&1) x=-x;
    // Convert to fixed 1.23.
    uint32_t arg=uint32_t(int32_t(fabsf(x)*0x1p23f));
    // Reduce to [0;1] (i.e. [0;2^23]).
    if(arg>=1u<<23) arg=(1u<<24)-arg;
    // Convert from fixed 1.28, and apply sign.
    return float(int32_t(vfpu_sin_fixed(arg)))*0x1p-28f*copysignf(1.0f,x);
}

// WARNING: not tested, this is just a guess.
static inline float vfpu_cos(float x)
{
    x=fabsf(x);
    // Handle large/non-finite values.
    if(!(fabsf(x)<0x1p25f)) return 1.0f+(x-x);
    // Reduce to [-2;+2].
    int32_t n=int32_t(0.5f*x); // Rounds to 0.
    x=x-2.0f*float(n); // Exact, no roundoff.
    // Convert to fixed 1.23.
    uint32_t arg=uint32_t(int32_t(x*0x1p23f));
    // Reduce to [0;1] (i.e. [0;2^23]).
    if(arg>=1u<<23) {arg=(1u<<24)-arg; n^=1;}
    // Convert from fixed 1.28, and apply sign.
    return float(int32_t(vfpu_sin_fixed((1u<<23)-arg)))*0x1p-28f*(n&1?-1.0f:+1.0f);
}

Edit: minor fixes.

Not ready for deployment, but may be useful as-is to test if it unbreaks some picky games.

fp64 commented 1 year ago

I'm using fixed 0.28 since it can represent all output values, but that might not be what is going on internally. As I already mentioned, the 2 lower bits of the bitwise representation of floating point output are always 0, meaning output has at most 22 (binary) significant digits. In other words, the stepping of the output in fixed 0.28 is

uint32_t quantum(uint32_t x)
{
    return x<1u<<22?
        1u:
        1u<<(32-22-count_leading_zeroes(x));
}

One idea I had is that we need an approximation (treated as real-valued here) f(i) that passes each region table[i]<=f(i)<table[i]+quantum(table[i]) (this may be whole input range, or piecewise), and then simply truncate it (to quantum). This is similar to what RLIBM does. This does require a bit fancy solver though.

Maybe something simpler suffices.

@unknownbrackets, you mentioned that you don't think this is CORDIC. Can you elaborate?

hrydgard commented 1 year ago

I don't believe it's CORDIC either because CORDIC is iterative, so it takes a variable-amount of time depending on the angle, and I don't think we've observed that. I would guess it's an interpolated table lookup from a smaller table, but I could be wrong of course.

fp64 commented 1 year ago

Hm, got runtime size down to 2MB: table of quadratic coefficients + table of exceptions. It's ugly, but may be workable. Currently searches exceptions via binary search, but it can be improved. In case you want it: vfpu_sin_fixed.zip

This implements extern uint32_t vfpu_sin_fixed(uint32_t arg), which you can plug into the reduction code above to give you actual vfpu_sin and vfpu_cos.

It is verified to produce the exactly the same result as vfpu_sin_table. The only advantage is that now memory for lookup table is down to 2MB from 32MB.

fp64 commented 1 year ago

The code itself (so you don't need to download to take a look):

vfpu_sin_fixed.cpp ```c++ #include "stdint.h" #include "quadratic_coefs.h" #include "exceptions.h" static inline uint32_t quantum(uint32_t x) { /* return x<1u<<22? 1u: 1u<<(32-22-__builtin_clz(x)); */ int i=0; while((x>>i)>=0x00400000) ++i; return uint32_t(1)<>7),j=int32_t(arg&0x7F); const int32_t *coef=quadratic_coefs[i]; int64_t A=coef[0],B=coef[1],C=coef[2]; // Compute approximation. Is off by at most 1 quantum. const int PA=30,PB=25,PC=3; uint32_t v=uint32_t(((((A*j>>(PA-PB))+B)*j>>(PB-PC))+C)>>PC); v=truncate_bits(v); // Look up exceptions. Binary search for now, but this // can be done better. unsigned lo=0,hi=sizeof(exceptions)/sizeof(exceptions[0]); while(lo>31?~b:b); if(e==arg) { v+=quantum(v)*(b>>31?-1u:+1u); break; } else if(e
hrydgard commented 1 year ago

That's cool, nicely done. Though 2MB is still of course far larger than the real table (or whatever it is) is likely to be in the actual hardware. Getting to the point where it could be worth it if it unbreaks stuff in games (that Ridge Racer replay, or whatever it was, maybe?).

fp64 commented 1 year ago

Though 2MB is still of course far larger than the real table (or whatever it is) is likely to be in the actual hardware.

Yeah, e.g. this mentions sizes around 10KB.

fp64 commented 1 year ago

Haven't yet verified that cos uses the same table (reversed).

and done.

fp64 commented 1 year ago

Made it uglier, but tables fit in 720 KB. Not sure if it's worth it.

Code:

vfpu_sin_fixed.cpp ```c++ #include "stdint.h" #include "vfpu_sin_lut.h" static inline uint32_t quantum(uint32_t x) { //return x<1u<<22? // 1u: // 1u<<(32-22-__builtin_clz(x)); int i=0; while((x>>i)>=0x00400000) ++i; return uint32_t(1)<>28); } // Crude approximation, may be off by around 5 quantums. static inline uint32_t vfpu_sin_approx(uint32_t x) { if((int32_t)x<0) return -vfpu_sin_approx(-x); x&=0x00FFFFFFu; if(x>=0x00800000u) x=0x01000000u-x; x*=(1u<<(28-23))/2; // Coefficients are for sinpi, so we divide by 2. uint32_t x2=mul28(x,x); return mul28(x, 843314880u -mul28(x2,1387197952u -mul28(x2, 684549312u -mul28(x2, 160694304u -mul28(x2, 21002648u))))); } uint32_t vfpu_sin_fixed(uint32_t arg) { // Handle endpoints. if(arg==0u) return 0u; if(arg==0x00800000) return 0x10000000; // Get cubic coefficients. // Note: coef stores deltas from crude approximation. const signed char *coef=vfpu_sin_lut_coefs[arg>>8]; int64_t P[4]; for(unsigned i=0;i<4;++i) { P[i]=int32_t(vfpu_sin_approx((arg&-256u)+256u*(i-1))); int32_t q=(int32_t)quantum(uint32_t(P[i]<0?-P[i]:P[i])); P[i]=8192LL*P[i]+512LL*coef[i]*q; } // Compute approximation. Is off by at most 1 quantum. uint32_t v=uint32_t(icerp(P[0],P[1],P[2],P[3],arg&255,256)/8192); v=truncate_bits(v); // Look up exceptions via binary search. // Note: vfpu_sin_lut_intervals stores // deltas from interval estimation. int32_t lo=int32_t(459376LL*( arg &-128u)/N)+16384 +vfpu_sin_lut_intervals[ arg >>7]; int32_t hi=int32_t(459376LL*((arg+128u)&-128u)/N)+16384 +vfpu_sin_lut_intervals[(arg+128u)>>7]; while(lo

vfpu_sin_fixed.zip

There should be a better way, but well...

Not sure if it's good enough to consider making pull request.

fp64 commented 1 year ago

It's also probably somewhat slower than before.

Also, amusingly, the table is not monotonic:

TABLE[5242176]==223176256
TABLE[5242177]==223176192
fp64 commented 1 year ago

In case anyone is curious, the junk for $|x|>2^{32}$ looks like this:

Table Input | (as float) | Output | (as float) ---------|------------------|----------|------------------ 7F800000 | inf| 7F800001 | nan FF800000 | -inf| FF800001 | -nan 7E800000 | 8.50705917e+37| 00000000 | 0 7F000000 | 1.70141183e+38| 00000000 | 0 7EC00000 | 1.27605888e+38| 00000000 | 0 7F400000 | 2.55211775e+38| 00000000 | 0 4FFFFFFF | 8.58993408e+09| 00000000 | 0 4F812345 | 4.33314458e+09| 00000000 | 0 CF812345 | -4.33314458e+09| 80000000 | -0 4FFFFFFF | 8.58993408e+09| 00000000 | 0 50012345 | 8.66628915e+09| BCE4BBA0 | -0.0279214978 D0012345 | -8.66628915e+09| 3CE4BBA0 | 0.0279214978 50FFFFFF | 3.43597363e+10| B5490000 | -7.4878335e-07 50812345 | 1.73325783e+10| 3D64A4C4 | 0.0558211952 D0812345 | -1.73325783e+10| BD64A4C4 | -0.0558211952 50FFFFFF | 3.43597363e+10| B5490000 | -7.4878335e-07 51012345 | 3.46651566e+10| 3DE44980 | 0.111468315 D1012345 | -3.46651566e+10| BDE44980 | -0.111468315 51FFFFFF | 1.37438945e+11| B6490000 | -2.9951334e-06 51812345 | 6.93303132e+10| 3E62DD4C | 0.221547306 D1812345 | -6.93303132e+10| BE62DD4C | -0.221547306 51FFFFFF | 1.37438945e+11| B6490000 | -2.9951334e-06 52012345 | 1.38660626e+11| 3EDD3A0C | 0.432083488 D2012345 | -1.38660626e+11| BEDD3A0C | -0.432083488 52FFFFFF | 5.49755781e+11| B7490000 | -1.19805336e-05 52812345 | 2.77321253e+11| 3F47827C | 0.779334784 D2812345 | -2.77321253e+11| BF47827C | -0.779334784 52FFFFFF | 5.49755781e+11| B7490000 | -1.19805336e-05 53012345 | 5.54642506e+11| 3F7A0754 | 0.976674318 D3012345 | -5.54642506e+11| BF7A0754 | -0.976674318 53FFFFFF | 2.19902312e+12| B8490C00 | -4.79333103e-05 53812345 | 1.10928501e+12| BED6C01C | -0.419434428 D3812345 | -1.10928501e+12| 3ED6C01C | 0.419434428 53FFFFFF | 2.19902312e+12| B8490C00 | -4.79333103e-05 54012345 | 2.21857002e+12| 3F42F284 | 0.761512995 D4012345 | -2.21857002e+12| BF42F284 | -0.761512995 54FFFFFF | 8.7960925e+12| B9491000 | -0.000191748142 54812345 | 4.43714005e+12| 3F7CB5C4 | 0.987148523 D4812345 | -4.43714005e+12| BF7CB5C4 | -0.987148523 54FFFFFF | 8.7960925e+12| B9491000 | -0.000191748142 55012345 | 8.87428009e+12| BEA18974 | -0.315501809 D5012345 | -8.87428009e+12| 3EA18974 | 0.315501809 55FFFFFF | 3.518437e+13| BA491040 | -0.000766996294 55812345 | 1.77485602e+13| 3F194954 | 0.598775148 D5812345 | -1.77485602e+13| BF194954 | -0.598775148 55FFFFFF | 3.518437e+13| BA491040 | -0.000766996294 56012345 | 3.54971204e+13| 3F758A18 | 0.959138393 D6012345 | -3.54971204e+13| BF758A18 | -0.959138393 56FFFFFF | 1.4073748e+14| BB491000 | -0.00306797028 56812345 | 7.09942407e+13| 3F0AF1B4 | 0.542750597 D6812345 | -7.09942407e+13| BF0AF1B4 | -0.542750597 56FFFFFF | 1.4073748e+14| BB491000 | -0.00306797028 57012345 | 1.41988481e+14| BF696590 | -0.911705971 D7012345 | -1.41988481e+14| 3F696590 | 0.911705971 57FFFFFF | 5.6294992e+14| BC490E90 | -0.0122715384 57812345 | 2.83976963e+14| BF3FC764 | -0.749136209 D7812345 | -2.83976963e+14| 3F3FC764 | 0.749136209 57FFFFFF | 5.6294992e+14| BC490E90 | -0.0122715384 58012345 | 5.67953926e+14| 3F7E1324 | 0.992479563 D8012345 | -5.67953926e+14| BF7E1324 | -0.992479563 58FFFFFF | 2.25179968e+15| BD48FB30 | -0.0490676761 58812345 | 1.13590785e+15| BE78CFCC | -0.242980182 D8812345 | -1.13590785e+15| 3E78CFCC | 0.242980182 58FFFFFF | 2.25179968e+15| BD48FB30 | -0.0490676761 59012345 | 2.2718157e+15| 3EF15AE8 | 0.471396685 D9012345 | -2.2718157e+15| BEF15AE8 | -0.471396685 59FFFFFF | 9.00719872e+15| BE47C5C0 | -0.195090294 59812345 | 4.54363141e+15| 3F54DB30 | 0.831469536 D9812345 | -4.54363141e+15| BF54DB30 | -0.831469536 59FFFFFF | 9.00719872e+15| BE47C5C0 | -0.195090294 5A012345 | 9.08726281e+15| 3F6C835C | 0.923879385 DA012345 | -9.08726281e+15| BF6C835C | -0.923879385 5AFFFFFF | 3.60287949e+16| BF3504F0 | -0.70710659 5A812345 | 1.81745256e+16| BF3504F0 | -0.70710659 DA812345 | -1.81745256e+16| 3F3504F0 | 0.70710659 5AFFFFFF | 3.60287949e+16| BF3504F0 | -0.70710659 5B012345 | 3.63490513e+16| 3F800000 | 1 DB012345 | -3.63490513e+16| BF800000 | -1 5BFFFFFF | 1.44115179e+17| 80000000 | -0 5B812345 | 7.26981025e+16| 80000000 | -0 DB812345 | -7.26981025e+16| 00000000 | 0 5BFFFFFF | 1.44115179e+17| 80000000 | -0 5C012345 | 1.45396205e+17| 00000000 | 0 DC012345 | -1.45396205e+17| 80000000 | -0 5CFFFFFF | 5.76460718e+17| 00000000 | 0 5C812345 | 2.9079241e+17| 00000000 | 0 DC812345 | -2.9079241e+17| 80000000 | -0 5CFFFFFF | 5.76460718e+17| 00000000 | 0 5D012345 | 5.8158482e+17| 00000000 | 0 DD012345 | -5.8158482e+17| 80000000 | -0 5DFFFFFF | 2.30584287e+18| 00000000 | 0 5D812345 | 1.16316964e+18| 00000000 | 0 DD812345 | -1.16316964e+18| 80000000 | -0 5DFFFFFF | 2.30584287e+18| 00000000 | 0 5E012345 | 2.32633928e+18| 00000000 | 0 DE012345 | -2.32633928e+18| 80000000 | -0 5EFFFFFF | 9.22337149e+18| 00000000 | 0 5E812345 | 4.65267856e+18| 00000000 | 0 DE812345 | -4.65267856e+18| 80000000 | -0 5EFFFFFF | 9.22337149e+18| 00000000 | 0 5F012345 | 9.30535712e+18| 00000000 | 0 DF012345 | -9.30535712e+18| 80000000 | -0 5FFFFFFF | 3.68934859e+19| 00000000 | 0 5F812345 | 1.86107142e+19| 00000000 | 0 DF812345 | -1.86107142e+19| 80000000 | -0 5FFFFFFF | 3.68934859e+19| 00000000 | 0 60012345 | 3.72214285e+19| BCE4BBA0 | -0.0279214978 E0012345 | -3.72214285e+19| 3CE4BBA0 | 0.0279214978 60FFFFFF | 1.47573944e+20| B5490000 | -7.4878335e-07 60812345 | 7.4442857e+19| 3D64A4C4 | 0.0558211952 E0812345 | -7.4442857e+19| BD64A4C4 | -0.0558211952 60FFFFFF | 1.47573944e+20| B5490000 | -7.4878335e-07 61012345 | 1.48885714e+20| 3DE44980 | 0.111468315 E1012345 | -1.48885714e+20| BDE44980 | -0.111468315 61FFFFFF | 5.90295775e+20| B6490000 | -2.9951334e-06 61812345 | 2.97771428e+20| 3E62DD4C | 0.221547306 E1812345 | -2.97771428e+20| BE62DD4C | -0.221547306 61FFFFFF | 5.90295775e+20| B6490000 | -2.9951334e-06 62012345 | 5.95542856e+20| 3EDD3A0C | 0.432083488 E2012345 | -5.95542856e+20| BEDD3A0C | -0.432083488 62FFFFFF | 2.3611831e+21| B7490000 | -1.19805336e-05 62812345 | 1.19108571e+21| 3F47827C | 0.779334784 E2812345 | -1.19108571e+21| BF47827C | -0.779334784 62FFFFFF | 2.3611831e+21| B7490000 | -1.19805336e-05 63012345 | 2.38217142e+21| 3F7A0754 | 0.976674318 E3012345 | -2.38217142e+21| BF7A0754 | -0.976674318 63FFFFFF | 9.4447324e+21| B8490C00 | -4.79333103e-05 63812345 | 4.76434285e+21| BED6C01C | -0.419434428 E3812345 | -4.76434285e+21| 3ED6C01C | 0.419434428 63FFFFFF | 9.4447324e+21| B8490C00 | -4.79333103e-05 64012345 | 9.52868569e+21| 3F42F284 | 0.761512995 E4012345 | -9.52868569e+21| BF42F284 | -0.761512995 64FFFFFF | 3.77789296e+22| B9491000 | -0.000191748142 64812345 | 1.90573714e+22| 3F7CB5C4 | 0.987148523 E4812345 | -1.90573714e+22| BF7CB5C4 | -0.987148523 64FFFFFF | 3.77789296e+22| B9491000 | -0.000191748142 65012345 | 3.81147428e+22| BEA18974 | -0.315501809 E5012345 | -3.81147428e+22| 3EA18974 | 0.315501809 65FFFFFF | 1.51115718e+23| BA491040 | -0.000766996294 65812345 | 7.62294855e+22| 3F194954 | 0.598775148 E5812345 | -7.62294855e+22| BF194954 | -0.598775148 65FFFFFF | 1.51115718e+23| BA491040 | -0.000766996294 66012345 | 1.52458971e+23| 3F758A18 | 0.959138393 E6012345 | -1.52458971e+23| BF758A18 | -0.959138393 66FFFFFF | 6.04462874e+23| BB491000 | -0.00306797028 66812345 | 3.04917942e+23| 3F0AF1B4 | 0.542750597 E6812345 | -3.04917942e+23| BF0AF1B4 | -0.542750597 66FFFFFF | 6.04462874e+23| BB491000 | -0.00306797028 67012345 | 6.09835884e+23| BF696590 | -0.911705971 E7012345 | -6.09835884e+23| 3F696590 | 0.911705971 67FFFFFF | 2.4178515e+24| BC490E90 | -0.0122715384 67812345 | 1.21967177e+24| BF3FC764 | -0.749136209 E7812345 | -1.21967177e+24| 3F3FC764 | 0.749136209 67FFFFFF | 2.4178515e+24| BC490E90 | -0.0122715384 68012345 | 2.43934354e+24| 3F7E1324 | 0.992479563 E8012345 | -2.43934354e+24| BF7E1324 | -0.992479563 68FFFFFF | 9.67140598e+24| BD48FB30 | -0.0490676761 68812345 | 4.87868707e+24| BE78CFCC | -0.242980182 E8812345 | -4.87868707e+24| 3E78CFCC | 0.242980182 68FFFFFF | 9.67140598e+24| BD48FB30 | -0.0490676761 69012345 | 9.75737415e+24| 3EF15AE8 | 0.471396685 E9012345 | -9.75737415e+24| BEF15AE8 | -0.471396685 69FFFFFF | 3.86856239e+25| BE47C5C0 | -0.195090294 69812345 | 1.95147483e+25| 3F54DB30 | 0.831469536 E9812345 | -1.95147483e+25| BF54DB30 | -0.831469536 69FFFFFF | 3.86856239e+25| BE47C5C0 | -0.195090294 6A012345 | 3.90294966e+25| 3F6C835C | 0.923879385 EA012345 | -3.90294966e+25| BF6C835C | -0.923879385 6AFFFFFF | 1.54742496e+26| BF3504F0 | -0.70710659 6A812345 | 7.80589932e+25| BF3504F0 | -0.70710659 EA812345 | -7.80589932e+25| 3F3504F0 | 0.70710659 6AFFFFFF | 1.54742496e+26| BF3504F0 | -0.70710659 6B012345 | 1.56117986e+26| 3F800000 | 1 EB012345 | -1.56117986e+26| BF800000 | -1 6BFFFFFF | 6.18969983e+26| 80000000 | -0 6B812345 | 3.12235973e+26| 80000000 | -0 EB812345 | -3.12235973e+26| 00000000 | 0 6BFFFFFF | 6.18969983e+26| 80000000 | -0 6C012345 | 6.24471946e+26| 00000000 | 0 EC012345 | -6.24471946e+26| 80000000 | -0 6CFFFFFF | 2.47587993e+27| 00000000 | 0 6C812345 | 1.24894389e+27| 00000000 | 0 EC812345 | -1.24894389e+27| 80000000 | -0 6CFFFFFF | 2.47587993e+27| 00000000 | 0 6D012345 | 2.49788778e+27| 00000000 | 0 ED012345 | -2.49788778e+27| 80000000 | -0 6DFFFFFF | 9.90351972e+27| 00000000 | 0 6D812345 | 4.99577556e+27| 00000000 | 0 ED812345 | -4.99577556e+27| 80000000 | -0 6DFFFFFF | 9.90351972e+27| 00000000 | 0 6E012345 | 9.99155113e+27| 00000000 | 0 EE012345 | -9.99155113e+27| 80000000 | -0 6EFFFFFF | 3.96140789e+28| 00000000 | 0 6E812345 | 1.99831023e+28| 00000000 | 0 EE812345 | -1.99831023e+28| 80000000 | -0 6EFFFFFF | 3.96140789e+28| 00000000 | 0 6F012345 | 3.99662045e+28| 00000000 | 0 EF012345 | -3.99662045e+28| 80000000 | -0 6FFFFFFF | 1.58456316e+29| 00000000 | 0 6F812345 | 7.9932409e+28| 00000000 | 0 EF812345 | -7.9932409e+28| 80000000 | -0 6FFFFFFF | 1.58456316e+29| 00000000 | 0 70012345 | 1.59864818e+29| BCE4BBA0 | -0.0279214978 F0012345 | -1.59864818e+29| 3CE4BBA0 | 0.0279214978 70FFFFFF | 6.33825262e+29| B5490000 | -7.4878335e-07 70812345 | 3.19729636e+29| 3D64A4C4 | 0.0558211952 F0812345 | -3.19729636e+29| BD64A4C4 | -0.0558211952 70FFFFFF | 6.33825262e+29| B5490000 | -7.4878335e-07 71012345 | 6.39459272e+29| 3DE44980 | 0.111468315 F1012345 | -6.39459272e+29| BDE44980 | -0.111468315 71FFFFFF | 2.53530105e+30| B6490000 | -2.9951334e-06 71812345 | 1.27891854e+30| 3E62DD4C | 0.221547306 F1812345 | -1.27891854e+30| BE62DD4C | -0.221547306 71FFFFFF | 2.53530105e+30| B6490000 | -2.9951334e-06 72012345 | 2.55783709e+30| 3EDD3A0C | 0.432083488 F2012345 | -2.55783709e+30| BEDD3A0C | -0.432083488 72FFFFFF | 1.01412042e+31| B7490000 | -1.19805336e-05 72812345 | 5.11567418e+30| 3F47827C | 0.779334784 F2812345 | -5.11567418e+30| BF47827C | -0.779334784 72FFFFFF | 1.01412042e+31| B7490000 | -1.19805336e-05 73012345 | 1.02313484e+31| 3F7A0754 | 0.976674318 F3012345 | -1.02313484e+31| BF7A0754 | -0.976674318 73FFFFFF | 4.05648168e+31| B8490C00 | -4.79333103e-05 73812345 | 2.04626967e+31| BED6C01C | -0.419434428 F3812345 | -2.04626967e+31| 3ED6C01C | 0.419434428 73FFFFFF | 4.05648168e+31| B8490C00 | -4.79333103e-05 74012345 | 4.09253934e+31| 3F42F284 | 0.761512995 F4012345 | -4.09253934e+31| BF42F284 | -0.761512995 74FFFFFF | 1.62259267e+32| B9491000 | -0.000191748142 74812345 | 8.18507868e+31| 3F7CB5C4 | 0.987148523 F4812345 | -8.18507868e+31| BF7CB5C4 | -0.987148523 74FFFFFF | 1.62259267e+32| B9491000 | -0.000191748142 75012345 | 1.63701574e+32| BEA18974 | -0.315501809 F5012345 | -1.63701574e+32| 3EA18974 | 0.315501809 75FFFFFF | 6.49037069e+32| BA491040 | -0.000766996294 75812345 | 3.27403147e+32| 3F194954 | 0.598775148 F5812345 | -3.27403147e+32| BF194954 | -0.598775148 75FFFFFF | 6.49037069e+32| BA491040 | -0.000766996294 76012345 | 6.54806295e+32| 3F758A18 | 0.959138393 F6012345 | -6.54806295e+32| BF758A18 | -0.959138393 76FFFFFF | 2.59614827e+33| BB491000 | -0.00306797028 76812345 | 1.30961259e+33| 3F0AF1B4 | 0.542750597 F6812345 | -1.30961259e+33| BF0AF1B4 | -0.542750597 76FFFFFF | 2.59614827e+33| BB491000 | -0.00306797028 77012345 | 2.61922518e+33| BF696590 | -0.911705971 F7012345 | -2.61922518e+33| 3F696590 | 0.911705971 77FFFFFF | 1.03845931e+34| BC490E90 | -0.0122715384 77812345 | 5.23845036e+33| BF3FC764 | -0.749136209 F7812345 | -5.23845036e+33| 3F3FC764 | 0.749136209 77FFFFFF | 1.03845931e+34| BC490E90 | -0.0122715384 78012345 | 1.04769007e+34| 3F7E1324 | 0.992479563 F8012345 | -1.04769007e+34| BF7E1324 | -0.992479563 78FFFFFF | 4.15383724e+34| BD48FB30 | -0.0490676761 78812345 | 2.09538014e+34| BE78CFCC | -0.242980182 F8812345 | -2.09538014e+34| 3E78CFCC | 0.242980182 78FFFFFF | 4.15383724e+34| BD48FB30 | -0.0490676761 79012345 | 4.19076029e+34| 3EF15AE8 | 0.471396685 F9012345 | -4.19076029e+34| BEF15AE8 | -0.471396685 79FFFFFF | 1.6615349e+35| BE47C5C0 | -0.195090294 79812345 | 8.38152057e+34| 3F54DB30 | 0.831469536 F9812345 | -8.38152057e+34| BF54DB30 | -0.831469536 79FFFFFF | 1.6615349e+35| BE47C5C0 | -0.195090294 7A012345 | 1.67630411e+35| 3F6C835C | 0.923879385 FA012345 | -1.67630411e+35| BF6C835C | -0.923879385 7AFFFFFF | 6.64613958e+35| BF3504F0 | -0.70710659 7A812345 | 3.35260823e+35| BF3504F0 | -0.70710659 FA812345 | -3.35260823e+35| 3F3504F0 | 0.70710659 7AFFFFFF | 6.64613958e+35| BF3504F0 | -0.70710659 7B012345 | 6.70521646e+35| 3F800000 | 1 FB012345 | -6.70521646e+35| BF800000 | -1 7BFFFFFF | 2.65845583e+36| 80000000 | -0 7B812345 | 1.34104329e+36| 80000000 | -0 FB812345 | -1.34104329e+36| 00000000 | 0 7BFFFFFF | 2.65845583e+36| 80000000 | -0 7C012345 | 2.68208658e+36| 00000000 | 0 FC012345 | -2.68208658e+36| 80000000 | -0 7CFFFFFF | 1.06338233e+37| 00000000 | 0 7C812345 | 5.36417317e+36| 00000000 | 0 FC812345 | -5.36417317e+36| 80000000 | -0 7CFFFFFF | 1.06338233e+37| 00000000 | 0 7D012345 | 1.07283463e+37| 00000000 | 0 FD012345 | -1.07283463e+37| 80000000 | -0 7DFFFFFF | 4.25352933e+37| 00000000 | 0 7D812345 | 2.14566927e+37| 00000000 | 0 FD812345 | -2.14566927e+37| 80000000 | -0 7DFFFFFF | 4.25352933e+37| 00000000 | 0 7E012345 | 4.29133853e+37| 00000000 | 0 FE012345 | -4.29133853e+37| 80000000 | -0 7EFFFFFF | 1.70141173e+38| 00000000 | 0 7E812345 | 8.58267707e+37| 00000000 | 0 FE812345 | -8.58267707e+37| 80000000 | -0 7EFFFFFF | 1.70141173e+38| 00000000 | 0 7F012345 | 1.71653541e+38| 00000000 | 0 FF012345 | -1.71653541e+38| 80000000 | -0
hrydgard commented 1 year ago

Wow, this is some wacky stuff. Non-monotonicity within the wrapped range is really surprising! Might possibly be some kind of hint to what's really going on, or it's just another indication that we won't be able to find it, because it's so weird :P

unknownbrackets commented 1 year ago

I think I had theorized that only some of the exponent bits are respected, causing this pattern. It seemed to mostly work. See:

    if (k > 0x80) {
        const uint8_t over = k & 0x1F;
        mantissa = (mantissa << over) & 0x00FFFFFF;
        k = 0x80;
    }

-[Unknown]

fp64 commented 1 year ago

const uint8_t over = k & 0x1F;

When I tried it before, it didn't seem to work. But I must have botched something, because now it does get mostly correct results:

https://godbolt.org/z/YbzhbshPd

Thanks!

hrydgard commented 1 year ago

A common way to "linearize" floats between 0 and 1 before doing lookups into a table is to simply add 1.0, after that, the mantissa represents the linear position between 1 and 2 and you can just drop the exponent and use the mantissa from there on as your table lookup index. But this is from a software perspective. That the exponent seems to wrap like that makes me think that there's a table lookup involving that too, or some specialized hardware that effectively does the same as adding 1.0 but cheaper and thus limited in the range of exponents it can process...

fp64 commented 1 year ago

It just looks like a shift nobody bothered to mask (same as x86 shr,shl instructions).

It might be even cyclic shift (so no need to branch on sign of exponent-0x7F) with some additional masking.

unknownbrackets commented 1 year ago

I think a table that's down to 720 KB is interesting. Not that terrible RAM usage if it fixes things.

Games I know that are or could be affected by sin/cos accuracy:

I wonder how much slower this variant is than what we do now, though... we'd probably end up not wanting to add it everywhere.

-[Unknown]

hrydgard commented 1 year ago

sin and cos are generally not the most frequently used mathematical operations, since they're normally used for things like setting up factors that are then multiplied with a lot of other things, typical example being a rotation matrix so a not-gigantic slowdown in these might not be so bad...

Though, given they are pretty fast on the hardware, maybe some games over-use them for other stuff..

And I agree that 720kb is getting palatable..

fp64 commented 1 year ago

Ok, I think this matches all available data:

static float vfpu_sin(float x)
{
    uint32_t bits;
    memcpy(&bits,&x,sizeof(x));
    uint32_t sign=bits&0x80000000u;
    uint32_t exponent=(bits>>23)&0xFFu;
    uint32_t significand=(bits&0x007FFFFFu)|0x00800000u;
    if(exponent==0xFFu)
    {
        // NOTE: this bitpattern is a signaling
        // NaN on x86, so maybe just return
        // a normal qNaN?
        float y;
        bits=sign^0x7F800001u;
        memcpy(&y,&bits,sizeof(y));
        return y;
    }
    if(exponent<0x7Fu)
    {
        if(exponent<0x7Fu-23u) significand=0u;
        else significand>>=(0x7F-exponent);
    }
    else if(exponent>0x7Fu)
    {
        // There is weirdness for large exponents.
        if(exponent-0x7Fu>=25u&&exponent-0x7Fu<32u) significand=0u;
        else if((exponent&0x9Fu)==0x9Fu) significand=0u;
        else significand<<=(exponent-0x7Fu);
    }
    sign^=((significand<<7)&0x80000000u);
    significand&=0x00FFFFFFu;
    if(significand>0x00800000u) significand=0x01000000u-significand;
    uint32_t ret=vfpu_sin_fixed(significand);
    return (sign?-1.0f:+1.0f)*float(int32_t(ret))*3.7252903e-09f; // 0x1p-28f
}

Notes:

hrydgard commented 1 year ago

Cool! So the remaining piece is reducing vfpu_sin_fix further if possible.

memcpy is perfectly endian-safe for something like this, as I haven't heard of an architecture that has different endian-ness for its floating point. In addition, we only support little-endian architectures in practice anyway.

fp64 commented 1 year ago

Actually, I still do not have the data to test reduction for cos (for larger values). It's easy to make a guess, but confirmation would be nice (e.g. does the sign of NaN match the input, or fabsf(input)?).

Should I make a PR? Not to be merged as-is, but just to have an artifact to test things with? If so:

Speed, as measured on my machine: Method Time
sinf(x*float(M_PI_2)) 1.0x
(float)sin(x*M_PI_2) 1.6x
32MB table, lin. access 1.83x
32MB table, rnd. access 3.6x
720KB table,vfpu_sin_fixed 12.0x

Aside: pedantry compels me to use 'significand' instead of 'manitissa', but that's just me.

fp64 commented 1 year ago

Does not seem to fix #2990 (the behavior is still the same as the 1st video). This much was already suspected.

Incidentally, when I got the sign of cos wrong (didn't do the signflip for [1;2)), the camera started behaving weirdly, but the path of the car itself seemed the same as before. Maybe sin/cos aren't used, or maybe angles are always in [-1;+1].

fp64 commented 1 year ago

Made it simpler (linear interpolation), smaller (combined LUTs add up to 484256 bytes), and faster (2.5x slower than sinf(C*x)).

#include "stdint.h"

#include "vfpu_sin_lut.h"

static inline uint32_t quantum(uint32_t x)
{
    //return x<1u<<22?
    //    1u:
    //    1u<<(32-22-__builtin_clz(x));
    int i=0;
    while((x>>i)>=0x00400000) ++i;
    return uint32_t(1)<<i;
}

static inline uint32_t truncate_bits(uint32_t x)
{
    return x&-quantum(x);
}

uint32_t vfpu_sin_fixed(uint32_t arg)
{
    // Handle endpoints.
    if(arg==0u) return 0u;
    if(arg==0x00800000) return 0x10000000;
    // Get endpoints for 8192-wide interval.
    uint32_t L=vfpu_sin_lut8192[(arg>>13)+0];
    uint32_t H=vfpu_sin_lut8192[(arg>>13)+1];
    // Approximate endpoints for 64-wide interval via lerp.
    uint32_t A=L+(((H-L)*(((arg>>6)&127)+0))>>7);
    uint32_t B=L+(((H-L)*(((arg>>6)&127)+1))>>7);
    // Adjust endpoints from deltas, and increase working precision.
    uint64_t a=(uint64_t(A)<<5)+uint64_t(vfpu_sin_lut_delta[arg>>6][0])*quantum(A);
    uint64_t b=(uint64_t(B)<<5)+uint64_t(vfpu_sin_lut_delta[arg>>6][1])*quantum(B);
    // Compute approximation via lerp. Is off by at most 1 quantum.
    uint32_t v=uint32_t(((a*(64-(arg&63))+b*(arg&63))>>6)>>5);
    v=truncate_bits(v);
    // Look up exceptions via binary search.
    // Note: vfpu_sin_lut_interval_delta stores
    // deltas from interval estimation.
    uint32_t lo=((169u*((arg>>7)+0))>>7)+uint32_t(vfpu_sin_lut_interval_delta[(arg>>7)+0])+16384u;
    uint32_t hi=((169u*((arg>>7)+1))>>7)+uint32_t(vfpu_sin_lut_interval_delta[(arg>>7)+1])+16384u;
    while(lo<hi)
    {
        uint32_t m=(lo+hi)/2;
        // Note: vfpu_sin_lut_exceptions stores
        // index&127 (for each initial interval the
        // upper bits of index are the same, namely
        // arg&-128), plus direction (0 for +1, and
        // 128 for -1).
        uint32_t b=vfpu_sin_lut_exceptions[m];
        uint32_t e=(arg&-128u)+(b&127u);
        if(e==arg)
        {
            v+=quantum(v)*(b>>7?-1u:+1u);
            break;
        }
        else if(e<arg) lo=m+1;
        else           hi=m;
    }
    return v;
}

vfpu_sin_fixed.zip

I think I'm generally ok with this implementation.

fp64 commented 1 year ago

Just to collect some issues that mention sin/cos:

2921

2990

6710

12900

13605

13671

13705

fp64 commented 1 year ago

If anyone wants to just test if this fixes something, there are some artifacts in my fork: https://github.com/fp64/ppsspp/actions/runs/4211662398#artifacts

sum2012 commented 1 year ago

13671 and #13705 still work with @fp64 build

fp64 commented 1 year ago

Thanks! Any observable slowdown?

Note: I simply changed the implementation in one file. Just to be sure: currently everything gets the same sin/cos functions, right? No compat setting (DoublePrecisionSinCos introduced in #13706 is removed), no jit/interpreter differences, all is routed to vfpu_sin, vfpu_cos, vfpu_sincos?

sum2012 commented 1 year ago

I am using AMD FX 8350 cpu so that no observable slowdown

unknownbrackets commented 1 year ago

I think 2.5x is probably a best case and won't cause much slowdown. I recall that Crisis Core uses vrot (to calculate sin/cos) a good bit and was sped up when we routed it through jit, so that might be a good test case. Probably it will be unnoticeable on any desktop, though.

// Copyright (c) 2012- PPSSPP Project. in the include file should probably be 2023. I think this would be good to include, even if there's no game we're (currently) certain it fixes.

I'm pretty sure everything should already be using the vfpu_{sin,cos,sincos} paths. Even jit just calls that, so it should all be through that.

I'll generate some data for cos() higher values, including inf/nan.

How interested would you be in trying to figure out more things like vrsq, vsqrt (both of which could, potentially, help Ridge Racer), vrcp, or vasin? I guess we should also look at Dissidia and see what instructions could have an impact there. I am worried that it's rounding, though, which would be a pain...

-[Unknown]

unknownbrackets commented 1 year ago

Here's a smattering of additional cos values: cos_others.zip

As we know the universe demands at least a modicum of symmetry, here's sin for the same: sin_others.zip

-[Unknown]

fp64 commented 1 year ago

Thanks!

// Copyright (c) 2012- PPSSPP Project. in the include file should probably be 2023.

I see, ok.

cos_others.zip

Oh, yes. vfpu_cos(1.0f) is -0.0f. Because of course it is. I think this was in previously available data, too. Thankfully, the fix is trivial. After the fix, except sNaNs turning into qNaNs - all passes.

sin_others.zip

All passes (again, ignoring NaN thing).

How interested would you be in trying to figure out more things like vrsq, vsqrt (both of which could, potentially, help Ridge Racer), vrcp, or vasin?

Yes, I'm interested to take a look, as long as the data is provided.

unknownbrackets commented 1 year ago

Yes, I'm interested to take a look, as long as the data is provided.

Ask, and you shall receive. Here's the same inputs for asin (organized the same way to avoid confusion): https://forums.ppsspp.org/uploads/asin_60-7F.zip https://forums.ppsspp.org/uploads/asin_others.zip

Also, here's just 7F and "others" if you want to look at any of exp2/log2/rexp2/rsq/sqrt (obviously these all have a larger domain): https://forums.ppsspp.org/uploads/exp_log_sqrt_7F_others.zip

I looked at sqrt/rsq a bit ago but haven't really checked the others. We might be fine on exp2/log2/etc. or maybe there are weird differences.

Our current asin is asinf(x) / M_PI_2.

-[Unknown]

fp64 commented 1 year ago

Thanks, downloaded.

"Reasonable" interval for sqrt/rsqrt is [1;4), since $\sqrt{4 x}=2 \sqrt{x}$, so that's our reduction (assuming HW isn't being extra weird).

hrydgard commented 1 year ago

For rsqrt, it could be interesting (for inspiration of things to try) to look at how Dolphin implements rsqrte :

https://github.com/dolphin-emu/dolphin/blob/ffdc8538a162b1ca56e3943c5a78c02358053b2a/Source/Core/Common/FloatUtils.cpp#L87

The PowerPC only has two approximate math functions though, frsqrte and frcp.

unknownbrackets commented 1 year ago

Right, that's fair. Hopefully it just scales the exponent in that way.

For good measure, I included from 0x60 - 0x81: https://forums.ppsspp.org/uploads/sqrt_60-81.zip https://forums.ppsspp.org/uploads/sqrt_others.zip

https://forums.ppsspp.org/uploads/rsqrt_60-81.zip https://forums.ppsspp.org/uploads/rsqrt_others.zip

-[Unknown]

fp64 commented 1 year ago

Looking at sqrt on [1;4).

3F8920FE[      1.07131934] -> 3F847C60[      1.03504562]
3F892100[      1.07131958] -> 3F847C5C[      1.03504515]

3F8F6BFE[      1.12048316] -> 3F877DE0[       1.0585289]
3F8F6C00[       1.1204834] -> 3F877DDC[      1.05852842]

3F93F37E[      1.15586829] -> 3F899D50[       1.0751133]
3F93F380[      1.15586853] -> 3F899D4C[      1.07511282]

3F9DB8FE[      1.23220801] -> 3F8E1614[      1.11004877]
3F9DB900[      1.23220825] -> 3F8E1610[      1.11004829]

3F9F4B7E[      1.24449134] -> 3F8ECAEC[      1.11556768]
3F9F4B80[      1.24449158] -> 3F8ECAE8[      1.11556721]

3FA1ACFE[       1.2630918] -> 3F8FDB18[      1.12387371]
3FA1AD00[      1.26309204] -> 3F8FDB14[      1.12387323]

3FA3C0FE[      1.27932715] -> 3F90C704[      1.13107347]
3FA3C100[      1.27932739] -> 3F90C700[        1.131073]

3FA3E77E[      1.28050208] -> 3F90D808[      1.13159275]
3FA3E780[      1.28050232] -> 3F90D804[      1.13159227]

3FA3EAFE[      1.28060889] -> 3F90D994[      1.13163996]
3FA3EB00[      1.28060913] -> 3F90D990[      1.13163948]

3FA5B57E[       1.2946012] -> 3F91A39C[      1.13780546]
3FA5B580[      1.29460144] -> 3F91A398[      1.13780499]

3FAD447E[      1.35365272] -> 3F94EC74[      1.16346598]
3FAD4480[      1.35365295] -> 3F94EC70[       1.1634655]

3FAD587E[      1.35426307] -> 3F94F50C[      1.16372824]
3FAD5880[      1.35426331] -> 3F94F508[      1.16372776]

3FADCD7E[      1.35783362] -> 3F952748[      1.16526127]
3FADCD80[      1.35783386] -> 3F952744[      1.16526079]

3FADD87E[      1.35816932] -> 3F952C00[      1.16540527]
3FADD880[      1.35816956] -> 3F952BFC[       1.1654048]

3FAFD67E[      1.37373328] -> 3F960630[      1.17206383]
3FAFD680[      1.37373352] -> 3F96062C[      1.17206335]

3FAFF2FE[      1.37460303] -> 3F961258[      1.17243481]
3FAFF300[      1.37460327] -> 3F961254[      1.17243433]

3FB38FFE[      1.40283179] -> 3F979AD0[        1.184412]
3FB39000[      1.40283203] -> 3F979ACC[      1.18441153]

3FB3B57E[       1.4039762] -> 3F97AAA4[      1.18489504]
3FB3B580[      1.40397644] -> 3F97AAA0[      1.18489456]

3FB7A67E[      1.43476844] -> 3F99521C[      1.19781828]
3FB7A680[      1.43476868] -> 3F995218[       1.1978178]

3FB7D7FE[      1.43627906] -> 3F9966C4[      1.19844866]
3FB7D800[       1.4362793] -> 3F9966C0[      1.19844818]

3FB949FE[      1.44757056] -> 3F9A00D4[      1.20315027]
3FB94A00[       1.4475708] -> 3F9A00D0[       1.2031498]

3FB9CCFE[      1.45156837] -> 3F9A373C[      1.20481062]
3FB9CD00[       1.4515686] -> 3F9A3738[      1.20481014]

3FBB007E[      1.46095252] -> 3F9AB6A4[      1.20869875]
3FBB0080[      1.46095276] -> 3F9AB6A0[      1.20869827]

3FBD2A7E[      1.47785926] -> 3F9B9B28[      1.21567249]
3FBD2A80[       1.4778595] -> 3F9B9B24[      1.21567202]

3FBDBEFE[      1.48239112] -> 3F9BD830[      1.21753502]
3FBDBF00[      1.48239136] -> 3F9BD82C[      1.21753454]

3FBDF57E[      1.48405433] -> 3F9BEE90[      1.21821785]
3FBDF580[      1.48405457] -> 3F9BEE8C[      1.21821737]

3FC1CDFE[      1.51409888] -> 3F9D809C[      1.23048735]
3FC1CE00[      1.51409912] -> 3F9D8098[      1.23048687]

3FC1D27E[      1.51423621] -> 3F9D8270[      1.23054314]
3FC1D280[      1.51423645] -> 3F9D826C[      1.23054266]

3FC3B57E[       1.5289762] -> 3F9E4638[      1.23651791]
3FC3B580[      1.52897644] -> 3F9E4634[      1.23651743]

3FC5A9FE[      1.54425025] -> 3F9F1018[      1.24267864]
3FC5AA00[      1.54425049] -> 3F9F1014[      1.24267817]

3FC5C97E[      1.54521155] -> 3F9F1CC4[      1.24306536]
3FC5C980[      1.54521179] -> 3F9F1CC0[      1.24306488]

3FC73DFE[      1.55657935] -> 3F9FB254[      1.24762964]
3FC73E00[      1.55657959] -> 3F9FB250[      1.24762917]

3FC7D27E[      1.56111121] -> 3F9FEDCC[      1.24944448]
3FC7D280[      1.56111145] -> 3F9FEDC8[      1.24944401]

3FC7E47E[      1.56166053] -> 3F9FF500[      1.24966431]
3FC7E480[      1.56166077] -> 3F9FF4FC[      1.24966383]

3FC97BFE[      1.57409644] -> 3FA097B8[      1.25463009]
3FC97C00[      1.57409668] -> 3FA097B4[      1.25462961]

3FCB007E[      1.58595252] -> 3FA13240[      1.25934601]
3FCB0080[      1.58595276] -> 3FA1323C[      1.25934553]

3FCBAF7E[       1.5912931] -> 3FA177AC[       1.2614646]
3FCBAF80[      1.59129333] -> 3FA177A8[      1.26146412]

3FCBEF7E[      1.59324622] -> 3FA19108[       1.2622385]
3FCBEF80[      1.59324646] -> 3FA19104[      1.26223803]

3FCD407E[      1.60353065] -> 3FA21650[      1.26630592]
3FCD4080[      1.60353088] -> 3FA2164C[      1.26630545]

3FCDA97E[      1.60673499] -> 3FA23FC0[       1.2675705]
3FCDA980[      1.60673523] -> 3FA23FBC[      1.26757002]

3FCDF7FE[      1.60913062] -> 3FA25EB4[      1.26851511]
3FCDF800[      1.60913086] -> 3FA25EB0[      1.26851463]

3FD1A1FE[      1.63775611] -> 3FA3CECC[      1.27974844]
3FD1A200[      1.63775635] -> 3FA3CEC8[      1.27974796]

3FD1D0FE[      1.63919044] -> 3FA3E128[      1.28030872]
3FD1D100[      1.63919067] -> 3FA3E124[      1.28030825]

3FD3007E[      1.64845252] -> 3FA45784[      1.28392076]
3FD30080[      1.64845276] -> 3FA45780[      1.28392029]

3FD341FE[      1.65045142] -> 3FA47104[      1.28469896]
3FD34200[      1.65045166] -> 3FA47100[      1.28469849]

3FD72EFE[      1.68112159] -> 3FA5F65C[      1.29658079]
3FD72F00[      1.68112183] -> 3FA5F658[      1.29658031]

3FD99AFE[      1.70004249] -> 3FA6E4C8[      1.30385685]
3FD99B00[      1.70004272] -> 3FA6E4C4[      1.30385637]

3FDBC77E[      1.71702552] -> 3FA7B9A8[      1.31035328]
3FDBC780[      1.71702576] -> 3FA7B9A4[       1.3103528]

3FDDA77E[      1.73167396] -> 3FA8706C[      1.31593084]
3FDDA780[      1.73167419] -> 3FA87068[      1.31593037]

3FDDDDFE[      1.73333716] -> 3FA88520[      1.31656265]
3FDDDE00[       1.7333374] -> 3FA8851C[      1.31656218]

3FDF447E[      1.74427772] -> 3FA90D10[      1.32071114]
3FDF4480[      1.74427795] -> 3FA90D0C[      1.32071066]

3FDFC17E[      1.74809241] -> 3FA93C5C[      1.32215452]
3FDFC180[      1.74809265] -> 3FA93C58[      1.32215405]

3FE353FE[      1.77600074] -> 3FAA94D4[      1.33266687]
3FE35400[      1.77600098] -> 3FAA94D0[       1.3326664]

3FE3BB7E[      1.77915931] -> 3FAABBA4[      1.33385134]
3FE3BB80[      1.77915955] -> 3FAABBA0[      1.33385086]

3FE3F6FE[       1.7809751] -> 3FAAD1F0[      1.33453178]
3FE3F700[      1.78097534] -> 3FAAD1EC[      1.33453131]

3FE5007E[      1.78907752] -> 3FAB354C[      1.33756399]
3FE50080[      1.78907776] -> 3FAB3548[      1.33756351]

3FE531FE[      1.79058814] -> 3FAB47CC[      1.33812857]
3FE53200[      1.79058838] -> 3FAB47C8[      1.33812809]

3FE78DFE[      1.80902076] -> 3FAC28E8[      1.34499836]
3FE78E00[        1.809021] -> 3FAC28E4[      1.34499788]

3FE9C8FE[      1.82644629] -> 3FACFCAC[      1.35146093]
3FE9C900[      1.82644653] -> 3FACFCA8[      1.35146046]

3FE9F0FE[        1.827667] -> 3FAD0B78[       1.3519125]
3FE9F100[      1.82766724] -> 3FAD0B74[      1.35191202]

3FED707E[      1.85499549] -> 3FAE5570[      1.36198235]
3FED7080[      1.85499573] -> 3FAE556C[      1.36198187]

3FED8DFE[      1.85589576] -> 3FAE6044[      1.36231279]
3FED8E00[        1.855896] -> 3FAE6040[      1.36231232]

3FEFAD7E[      1.87248206] -> 3FAF274C[      1.36838675]
3FEFAD80[       1.8724823] -> 3FAF2748[      1.36838627]

3FF18A7E[      1.88703895] -> 3FAFD540[      1.37369537]
3FF18A80[      1.88703918] -> 3FAFD53C[       1.3736949]

3FF19CFE[      1.88760352] -> 3FAFDBFC[      1.37390089]
3FF19D00[      1.88760376] -> 3FAFDBF8[      1.37390041]

3FF1EF7E[      1.89012122] -> 3FAFFA00[      1.37481689]
3FF1EF80[      1.89012146] -> 3FAFF9FC[      1.37481642]

3FF39EFE[      1.90328956] -> 3FB096A8[      1.37959766]
3FF39F00[      1.90328979] -> 3FB096A4[      1.37959719]

3FF3CEFE[       1.9047544] -> 3FB0A80C[      1.38012838]
3FF3CF00[      1.90475464] -> 3FB0A808[      1.38012791]

3FF57AFE[      1.91781592] -> 3FB142D8[      1.38485241]
3FF57B00[      1.91781616] -> 3FB142D4[      1.38485193]

3FF58DFE[      1.91839576] -> 3FB149B4[      1.38506174]
3FF58E00[        1.918396] -> 3FB149B0[      1.38506126]

3FF7E3FE[      1.93664527] -> 3FB22110[      1.39163399]
3FF7E400[      1.93664551] -> 3FB2210C[      1.39163351]

3FF7E9FE[      1.93682837] -> 3FB22338[      1.39169979]
3FF7EA00[      1.93682861] -> 3FB22334[      1.39169931]

3FF9807E[      1.94923377] -> 3FB2B508[      1.39614964]
3FF98080[      1.94923401] -> 3FB2B504[      1.39614916]

3FF9E9FE[      1.95245337] -> 3FB2DACC[      1.39730215]
3FF9EA00[      1.95245361] -> 3FB2DAC8[      1.39730167]

3FF9EF7E[      1.95262122] -> 3FB2DCC4[      1.39736223]
3FF9EF80[      1.95262146] -> 3FB2DCC0[      1.39736176]

3FFB8A7E[      1.96516395] -> 3FB36F98[      1.40184307]
3FFB8A80[      1.96516418] -> 3FB36F94[      1.40184259]

3FFDD0FE[      1.98294044] -> 3FB43EE4[      1.40816927]
3FFDD100[      1.98294067] -> 3FB43EE0[      1.40816879]

3FFDE9FE[      1.98370337] -> 3FB447C4[      1.40844011]
3FFDEA00[      1.98370361] -> 3FB447C0[      1.40843964]

3FFDEF7E[      1.98387122] -> 3FB449B8[      1.40849972]
3FFDEF80[      1.98387146] -> 3FB449B4[      1.40849924]

3FFF93FE[      1.99670386] -> 3FB4DEC0[      1.41304779]
3FFF9400[       1.9967041] -> 3FB4DEBC[      1.41304731]

3FFFE3FE[      1.99914527] -> 3FB4FB0C[      1.41391134]
3FFFE400[      1.99914551] -> 3FB4FB08[      1.41391087]

3FFFF57E[      1.99967933] -> 3FB5013C[      1.41410017]
3FFFF580[      1.99967957] -> 3FB50138[      1.41409969]

This is a whole barrel of fun.

I might as well add, that IEEE754 actually mandates that sqrt is correctly rounded (same as +,-,*,/,fma). PSP does not seem to care, maybe does not consider this instruction real sqrt.

fp64 commented 1 year ago

The answer is never more than 4 away from bitcast<uint32_t>(sqrtf(input))&-4u, regardless of rounding direction of sqrtf (though number of not matching outputs is different: around 1M for FE_TOWARDZERO to 2M for FE_UPWARD, with FE_TONEAREST in between).

All of $2^{21}$ outputs (since bottom 2 bits are 0) in [1;2) are present.

fp64 commented 1 year ago

The current implementation is also within 4ULP of output (with 1.2M mismatches).

fp64 commented 1 year ago

Ok, assuming we do not find anything better, I think sqrt on [1;4) can be compressed into 256KB table (lerp segments, no exceptions).

fp64 commented 1 year ago

Hopefully it just scales the exponent in that way.

Verified on all known data (sans denormals).

Reduction:

float vfpu_sqrt(float x)
{
    uint32_t bits;
    memcpy(&bits,&x,sizeof(bits));
    if((bits&0x7FFFFFFFu)<=0x007FFFFFu)
    {
        // Denormals (and zeroes) get +0, regardless
        // of sign.
        return +0.0f;
    }
    if(bits>>31)
    {
        // Other negatives get NaN.
        bits=0x7F800001u;
        memcpy(&x,&bits,sizeof(x));
        return x;
    }
    if((bits>>23)==255u)
    {
        // Inf/NaN gets Inf/NaN.
        bits=0x7F800000u+((bits&0x007FFFFFu)!=0u);
        memcpy(&x,&bits,sizeof(bits));
        return x;
    }
    int32_t exponent=int32_t(bits>>23)-127;
    // Bottom bit of exponent (inverted) + significand (except bottom bit).
    uint32_t index=((bits+0x00800000u)>>1)&0x007FFFFFu;
    bits=vfpu_sqrt_fixed(index);
    bits+=uint32_t(exponent>>1)<<23;
    memcpy(&x,&bits,sizeof(bits));
    return x;
}
hrydgard commented 1 year ago

That's cool. I considered that one possibility of how it computes sqrt internally might be 1/rsqrt (since rsqrt is actually easier to compute with newton-raphson than sqrt), but seems a bit too good unless the rcp (1/x) approximation is really good too..

fp64 commented 1 year ago

Table-based vfpu_sqrt_fixed (256 KB):

From vfpu_sqrt.cpp ```c++ #include #include #include #include #include "vfpu_sqrt_lut.h" // Integer square root of 2^23*x (rounded to zero). // Input is in 2^23 <= x < 2^25, and representable in float. static inline uint32_t isqrt23(uint32_t x) { #if 0 // Reference version. int dir=fesetround(FE_TOWARDZERO); uint32_t ret=uint32_t(int32_t(sqrtf(float(int32_t(x))*8388608.0f))); fesetround(dir); return ret; #elif 1 // Double version. // Verified to produce correct result for all valid inputs, // in all rounding modes, both in double and double-extended (x87) // precision. // Requires correctly-rounded sqrt (which on any IEEE-754 system // it should be). return uint32_t(int32_t(sqrt(double(x)*8388608.0))); #else // Pure integer version, if you don't like floating point. // Based on code from Hacker's Delight. See isqrt4 in // https://github.com/hcs0/Hackers-Delight/blob/master/isqrt.c.txt // Relatively slow. uint64_t t=uint64_t(x)<<23,m,y,b; m=0x4000000000000000ull; y=0; while(m!=0) // Do 32 times. { b=y|m; y=y>>1; if(t>=b) { t=t-b; y=y|m; } m=m>>2; } return y; #endif } // Returns floating-point bitpattern. static inline uint32_t vfpu_sqrt_fixed(uint32_t x) { // Endpoints of input. uint32_t lo=(x+ 0u)&-64u; uint32_t hi=(x+64u)&-64u; // Convert input to 9.23 fixed-point. lo=(lo>=0x00400000u?4u*lo:0x00800000u+2u*lo); hi=(hi>=0x00400000u?4u*hi:0x00800000u+2u*hi); // Estimate endpoints of output. uint32_t A=0x3F000000u+(isqrt23(lo)); uint32_t B=0x3F000000u+(isqrt23(hi)); // Apply deltas, and increase the working precision. uint64_t a=(uint64_t(A)<<4)+uint64_t(vfpu_sqrt_lut[x>>6][0]); uint64_t b=(uint64_t(B)<<4)+uint64_t(vfpu_sqrt_lut[x>>6][1]); uint32_t ret=uint32_t((a+(((b-a)*(x&63))>>6))>>4); // Truncate lower 2 bits. ret&=-4u; return ret; } ```

vfpu_sqrt.zip