Open fp64 opened 1 year ago
Cool. So we should have the PSP generate a new table from 0.23 fixpoint, from 0 to 1 would be enough (or even 0 to 0.5).
Then I guess theoretically we can try to compute first, second and third derivatives of that sequence. If those series becomes a series of straight lines or plateaus at some derivative, we'll know we're dealing with polynomial interpolation of a much smaller table, which I think might be likely. Though I don't know if precision is enough for those derivatives to make sense...
Storing the full table will hopefully not be necessary, though by taking the derivative before compressing, the result should be really compressible... (actually that might be nonsense given how sin and cos are each other's derivative, sometimes with flipped signs, but we have to be dealing with an approximation here...)
from 0 to 1 would be enough (or even 0 to 0.5).
To be clear
The table is dense for x>=0.25
means, that all values for [0.25;1] are already known. Only [0;0.25] remains.
Storing the full table will hopefully not be necessary
For one thing 32MB might be larger than L3 cache.
though by taking the derivative before compressing, the result should be really compressible...
Tried it:
33554436 table.bin
3430864 table.bin.zip
4645219 table.bin.zst
33554436 delta1.bin
171407 delta1.bin.zip
238125 delta1.bin.zst
33554436 delta2.bin
175539 delta2.bin.zip
220832 delta2.bin.zst
delta1
and delta2
are first and second finite differences. Both zip and zstd were invoked with -9
.
Oh yeah, didn't read carefully enough, so a smaller range to dump.
Yeah, that got really compressible - but that it didn't get even more compressible at the second derivative .. well, not sure what it means to be honest :P
By the way, 4 way symmetry is good, but 8 way symmetry might be possible with some flipping around of signs and offsetting. Or hm, maybe not...
Note: result is -0 when x is in [...,-6, -2, -0, +2, +6,...].
Grr, no, it's [...,-8,-4,-0,+2,+6,...]: both sources of signflip (signbit(x)
and (x mod 4==2)
) just XOR with each other.
Oh, hm. Maybe I overwrote that file I uploaded before with a smaller subset. It took a lot of time to generate, I thought I saved it... anyway, I had dumped some of it as binary. Let me just regenerate, at least for exponents between 0x60 - 0x7F....
-[Unknown]
https://forums.ppsspp.org/uploads/sin_60-7F.zip https://forums.ppsspp.org/uploads/cos_60-7F.zip
Here's all the exponents from 60-7F, each as a separate file. 60-67 aren't that interesting, just included to represent. The data is simply input, output repeated in 4 byte pairs.
-[Unknown]
Here's all the exponents from 60-7F, each as a separate file.
Thanks!
Yeah, that got really compressible - but that it didn't get even more compressible at the second derivative .. well, not sure what it means to be honest :P
Well, after the first delta we already get long repeats of the same value, e.g.
50,50,50,50,50,49,49,49,49,49
.
Second delta gets us long repeats of zeroes, e.g.:
0,0,0,0,0,-1,0,0,0,0
which isn't much better for compression, I think (e.g. for LZ the first example is something like LITERAL(50),COPY(-1,4),LITERAL(49),COPY(-1,4)
, and the second is LITERAL(0),COPY(-1,4),LITERAL(-1),LITERAL(0),COPY(-1,3)
- so one command more).
Converted the provided files. So, now we have a complete fixed23->fixed28
table for sin
:
sin.bin.zip
Again, just uint32
outputs (input is index, i.e. offset in file divided by 4), LE, in binary.
Haven't yet verified that cos
uses the same table (reversed).
Again, thanks for actually dumping the data!
The draft implementation of the reduction:
static uint32_t vfpu_sin_table[(1<<23)+1]={
#include "sin.bin.h" // Or #embed, or whatever.
};
static inline uint32_t vfpu_sin_fixed(uint32_t arg)
{
// Table-based for now.
return vfpu_sin_table[arg];
}
// For |x|<2^32 should produce exactly the
// same value (bitwise) as PSP. For finite
// x>=2^32 produces 0, whereas PSP sometimes
// produces junk (which looks like small
// non-zero values). Hopefully games do not rely
// on this. For inf/nan produces qNaN, but
// with a different payload than PSP (PSP produces
// sNaN).
static inline float vfpu_sin(float x)
{
// Handle large/non-finite values.
if(!(fabsf(x)<0x1p25f)) return copysignf(x-x,x); // Beware -ffast-math.
// Reduce to [-2;+2].
int32_t n=int32_t(0.5f*x); // Rounds to 0.
x=x-2.0f*float(n); // Exact, no roundoff. Also, -0 remains -0.
// Flip sign according to half-period.
if(n&1) x=-x;
// Convert to fixed 1.23.
uint32_t arg=uint32_t(int32_t(fabsf(x)*0x1p23f));
// Reduce to [0;1] (i.e. [0;2^23]).
if(arg>=1u<<23) arg=(1u<<24)-arg;
// Convert from fixed 1.28, and apply sign.
return float(int32_t(vfpu_sin_fixed(arg)))*0x1p-28f*copysignf(1.0f,x);
}
// WARNING: not tested, this is just a guess.
static inline float vfpu_cos(float x)
{
x=fabsf(x);
// Handle large/non-finite values.
if(!(fabsf(x)<0x1p25f)) return 1.0f+(x-x);
// Reduce to [-2;+2].
int32_t n=int32_t(0.5f*x); // Rounds to 0.
x=x-2.0f*float(n); // Exact, no roundoff.
// Convert to fixed 1.23.
uint32_t arg=uint32_t(int32_t(x*0x1p23f));
// Reduce to [0;1] (i.e. [0;2^23]).
if(arg>=1u<<23) {arg=(1u<<24)-arg; n^=1;}
// Convert from fixed 1.28, and apply sign.
return float(int32_t(vfpu_sin_fixed((1u<<23)-arg)))*0x1p-28f*(n&1?-1.0f:+1.0f);
}
Edit: minor fixes.
Not ready for deployment, but may be useful as-is to test if it unbreaks some picky games.
I'm using fixed 0.28 since it can represent all output values, but that might not be what is going on internally. As I already mentioned, the 2 lower bits of the bitwise representation of floating point output are always 0, meaning output has at most 22 (binary) significant digits. In other words, the stepping of the output in fixed 0.28 is
uint32_t quantum(uint32_t x)
{
return x<1u<<22?
1u:
1u<<(32-22-count_leading_zeroes(x));
}
One idea I had is that we need an approximation (treated as real-valued here) f(i)
that passes each region table[i]<=f(i)<table[i]+quantum(table[i])
(this may be whole input range, or piecewise), and then simply truncate it (to quantum). This is similar to what RLIBM does. This does require a bit fancy solver though.
Maybe something simpler suffices.
@unknownbrackets, you mentioned that you don't think this is CORDIC. Can you elaborate?
I don't believe it's CORDIC either because CORDIC is iterative, so it takes a variable-amount of time depending on the angle, and I don't think we've observed that. I would guess it's an interpolated table lookup from a smaller table, but I could be wrong of course.
Hm, got runtime size down to 2MB: table of quadratic coefficients + table of exceptions. It's ugly, but may be workable. Currently searches exceptions via binary search, but it can be improved. In case you want it: vfpu_sin_fixed.zip
This implements extern uint32_t vfpu_sin_fixed(uint32_t arg)
, which you can plug into the reduction code above to give you actual vfpu_sin
and vfpu_cos
.
It is verified to produce the exactly the same result as vfpu_sin_table
. The only advantage is that now memory for lookup table is down to 2MB from 32MB.
The code itself (so you don't need to download to take a look):
That's cool, nicely done. Though 2MB is still of course far larger than the real table (or whatever it is) is likely to be in the actual hardware. Getting to the point where it could be worth it if it unbreaks stuff in games (that Ridge Racer replay, or whatever it was, maybe?).
Though 2MB is still of course far larger than the real table (or whatever it is) is likely to be in the actual hardware.
Yeah, e.g. this mentions sizes around 10KB.
Haven't yet verified that cos uses the same table (reversed).
and done.
Made it uglier, but tables fit in 720 KB. Not sure if it's worth it.
Code:
There should be a better way, but well...
Not sure if it's good enough to consider making pull request.
It's also probably somewhat slower than before.
Also, amusingly, the table is not monotonic:
TABLE[5242176]==223176256
TABLE[5242177]==223176192
In case anyone is curious, the junk for $|x|>2^{32}$ looks like this:
Wow, this is some wacky stuff. Non-monotonicity within the wrapped range is really surprising! Might possibly be some kind of hint to what's really going on, or it's just another indication that we won't be able to find it, because it's so weird :P
I think I had theorized that only some of the exponent bits are respected, causing this pattern. It seemed to mostly work. See:
if (k > 0x80) {
const uint8_t over = k & 0x1F;
mantissa = (mantissa << over) & 0x00FFFFFF;
k = 0x80;
}
-[Unknown]
const uint8_t over = k & 0x1F;
When I tried it before, it didn't seem to work. But I must have botched something, because now it does get mostly correct results:
https://godbolt.org/z/YbzhbshPd
Thanks!
A common way to "linearize" floats between 0 and 1 before doing lookups into a table is to simply add 1.0, after that, the mantissa represents the linear position between 1 and 2 and you can just drop the exponent and use the mantissa from there on as your table lookup index. But this is from a software perspective. That the exponent seems to wrap like that makes me think that there's a table lookup involving that too, or some specialized hardware that effectively does the same as adding 1.0 but cheaper and thus limited in the range of exponents it can process...
It just looks like a shift nobody bothered to mask (same as x86 shr
,shl
instructions).
It might be even cyclic shift (so no need to branch on sign of exponent-0x7F
) with some additional masking.
I think a table that's down to 720 KB is interesting. Not that terrible RAM usage if it fixes things.
Games I know that are or could be affected by sin/cos accuracy:
I wonder how much slower this variant is than what we do now, though... we'd probably end up not wanting to add it everywhere.
-[Unknown]
sin and cos are generally not the most frequently used mathematical operations, since they're normally used for things like setting up factors that are then multiplied with a lot of other things, typical example being a rotation matrix so a not-gigantic slowdown in these might not be so bad...
Though, given they are pretty fast on the hardware, maybe some games over-use them for other stuff..
And I agree that 720kb is getting palatable..
Ok, I think this matches all available data:
static float vfpu_sin(float x)
{
uint32_t bits;
memcpy(&bits,&x,sizeof(x));
uint32_t sign=bits&0x80000000u;
uint32_t exponent=(bits>>23)&0xFFu;
uint32_t significand=(bits&0x007FFFFFu)|0x00800000u;
if(exponent==0xFFu)
{
// NOTE: this bitpattern is a signaling
// NaN on x86, so maybe just return
// a normal qNaN?
float y;
bits=sign^0x7F800001u;
memcpy(&y,&bits,sizeof(y));
return y;
}
if(exponent<0x7Fu)
{
if(exponent<0x7Fu-23u) significand=0u;
else significand>>=(0x7F-exponent);
}
else if(exponent>0x7Fu)
{
// There is weirdness for large exponents.
if(exponent-0x7Fu>=25u&&exponent-0x7Fu<32u) significand=0u;
else if((exponent&0x9Fu)==0x9Fu) significand=0u;
else significand<<=(exponent-0x7Fu);
}
sign^=((significand<<7)&0x80000000u);
significand&=0x00FFFFFFu;
if(significand>0x00800000u) significand=0x01000000u-significand;
uint32_t ret=vfpu_sin_fixed(significand);
return (sign?-1.0f:+1.0f)*float(int32_t(ret))*3.7252903e-09f; // 0x1p-28f
}
Notes:
-mfpmath=387
on 32-bit) sNaN tends to get turned into qNaN.memcpy
is not endian-safe, this needs a proper float_as_bits
(can be concocted with frexpf
).Cool! So the remaining piece is reducing vfpu_sin_fix further if possible.
memcpy is perfectly endian-safe for something like this, as I haven't heard of an architecture that has different endian-ness for its floating point. In addition, we only support little-endian architectures in practice anyway.
Actually, I still do not have the data to test reduction for cos
(for larger values). It's easy to make a guess, but confirmation would be nice (e.g. does the sign of NaN match the input, or fabsf(input)
?).
Should I make a PR? Not to be merged as-is, but just to have an artifact to test things with? If so:
vfpu_sin_lut.h
(to be #include
'd in MIPSVFPUUtils.cpp only)?CMakeLists.txt
somewhere?Speed, as measured on my machine: Method | Time |
---|---|
sinf(x*float(M_PI_2)) |
1.0x |
(float)sin(x*M_PI_2) |
1.6x |
32MB table, lin. access | 1.83x |
32MB table, rnd. access | 3.6x |
720KB table,vfpu_sin_fixed |
12.0x |
Aside: pedantry compels me to use 'significand' instead of 'manitissa', but that's just me.
Does not seem to fix #2990 (the behavior is still the same as the 1st video). This much was already suspected.
Incidentally, when I got the sign of cos wrong (didn't do the signflip for [1;2)), the camera started behaving weirdly, but the path of the car itself seemed the same as before. Maybe sin/cos aren't used, or maybe angles are always in [-1;+1].
Made it simpler (linear interpolation), smaller (combined LUTs add up to 484256 bytes), and faster (2.5x slower than sinf(C*x)
).
#include "stdint.h"
#include "vfpu_sin_lut.h"
static inline uint32_t quantum(uint32_t x)
{
//return x<1u<<22?
// 1u:
// 1u<<(32-22-__builtin_clz(x));
int i=0;
while((x>>i)>=0x00400000) ++i;
return uint32_t(1)<<i;
}
static inline uint32_t truncate_bits(uint32_t x)
{
return x&-quantum(x);
}
uint32_t vfpu_sin_fixed(uint32_t arg)
{
// Handle endpoints.
if(arg==0u) return 0u;
if(arg==0x00800000) return 0x10000000;
// Get endpoints for 8192-wide interval.
uint32_t L=vfpu_sin_lut8192[(arg>>13)+0];
uint32_t H=vfpu_sin_lut8192[(arg>>13)+1];
// Approximate endpoints for 64-wide interval via lerp.
uint32_t A=L+(((H-L)*(((arg>>6)&127)+0))>>7);
uint32_t B=L+(((H-L)*(((arg>>6)&127)+1))>>7);
// Adjust endpoints from deltas, and increase working precision.
uint64_t a=(uint64_t(A)<<5)+uint64_t(vfpu_sin_lut_delta[arg>>6][0])*quantum(A);
uint64_t b=(uint64_t(B)<<5)+uint64_t(vfpu_sin_lut_delta[arg>>6][1])*quantum(B);
// Compute approximation via lerp. Is off by at most 1 quantum.
uint32_t v=uint32_t(((a*(64-(arg&63))+b*(arg&63))>>6)>>5);
v=truncate_bits(v);
// Look up exceptions via binary search.
// Note: vfpu_sin_lut_interval_delta stores
// deltas from interval estimation.
uint32_t lo=((169u*((arg>>7)+0))>>7)+uint32_t(vfpu_sin_lut_interval_delta[(arg>>7)+0])+16384u;
uint32_t hi=((169u*((arg>>7)+1))>>7)+uint32_t(vfpu_sin_lut_interval_delta[(arg>>7)+1])+16384u;
while(lo<hi)
{
uint32_t m=(lo+hi)/2;
// Note: vfpu_sin_lut_exceptions stores
// index&127 (for each initial interval the
// upper bits of index are the same, namely
// arg&-128), plus direction (0 for +1, and
// 128 for -1).
uint32_t b=vfpu_sin_lut_exceptions[m];
uint32_t e=(arg&-128u)+(b&127u);
if(e==arg)
{
v+=quantum(v)*(b>>7?-1u:+1u);
break;
}
else if(e<arg) lo=m+1;
else hi=m;
}
return v;
}
I think I'm generally ok with this implementation.
Just to collect some issues that mention sin/cos:
If anyone wants to just test if this fixes something, there are some artifacts in my fork: https://github.com/fp64/ppsspp/actions/runs/4211662398#artifacts
Thanks! Any observable slowdown?
Note: I simply changed the implementation in one file. Just to be sure: currently everything gets the same sin/cos functions, right? No compat setting (DoublePrecisionSinCos
introduced in #13706 is removed), no jit/interpreter differences, all is routed to vfpu_sin
, vfpu_cos
, vfpu_sincos
?
I am using AMD FX 8350 cpu so that no observable slowdown
I think 2.5x is probably a best case and won't cause much slowdown. I recall that Crisis Core uses vrot (to calculate sin/cos) a good bit and was sped up when we routed it through jit, so that might be a good test case. Probably it will be unnoticeable on any desktop, though.
// Copyright (c) 2012- PPSSPP Project.
in the include file should probably be 2023. I think this would be good to include, even if there's no game we're (currently) certain it fixes.
I'm pretty sure everything should already be using the vfpu_{sin,cos,sincos} paths. Even jit just calls that, so it should all be through that.
I'll generate some data for cos() higher values, including inf/nan.
How interested would you be in trying to figure out more things like vrsq, vsqrt (both of which could, potentially, help Ridge Racer), vrcp, or vasin? I guess we should also look at Dissidia and see what instructions could have an impact there. I am worried that it's rounding, though, which would be a pain...
-[Unknown]
Here's a smattering of additional cos values: cos_others.zip
As we know the universe demands at least a modicum of symmetry, here's sin for the same: sin_others.zip
-[Unknown]
Thanks!
// Copyright (c) 2012- PPSSPP Project. in the include file should probably be 2023.
I see, ok.
cos_others.zip
Oh, yes. vfpu_cos(1.0f)
is -0.0f
. Because of course it is. I think this was in previously available data, too. Thankfully, the fix is trivial.
After the fix, except sNaNs turning into qNaNs - all passes.
sin_others.zip
All passes (again, ignoring NaN thing).
How interested would you be in trying to figure out more things like vrsq, vsqrt (both of which could, potentially, help Ridge Racer), vrcp, or vasin?
Yes, I'm interested to take a look, as long as the data is provided.
Yes, I'm interested to take a look, as long as the data is provided.
Ask, and you shall receive. Here's the same inputs for asin (organized the same way to avoid confusion): https://forums.ppsspp.org/uploads/asin_60-7F.zip https://forums.ppsspp.org/uploads/asin_others.zip
Also, here's just 7F and "others" if you want to look at any of exp2/log2/rexp2/rsq/sqrt (obviously these all have a larger domain): https://forums.ppsspp.org/uploads/exp_log_sqrt_7F_others.zip
I looked at sqrt/rsq a bit ago but haven't really checked the others. We might be fine on exp2/log2/etc. or maybe there are weird differences.
Our current asin is asinf(x) / M_PI_2
.
-[Unknown]
Thanks, downloaded.
"Reasonable" interval for sqrt/rsqrt is [1;4), since $\sqrt{4 x}=2 \sqrt{x}$, so that's our reduction (assuming HW isn't being extra weird).
For rsqrt, it could be interesting (for inspiration of things to try) to look at how Dolphin implements rsqrte :
The PowerPC only has two approximate math functions though, frsqrte and frcp.
Right, that's fair. Hopefully it just scales the exponent in that way.
For good measure, I included from 0x60 - 0x81: https://forums.ppsspp.org/uploads/sqrt_60-81.zip https://forums.ppsspp.org/uploads/sqrt_others.zip
https://forums.ppsspp.org/uploads/rsqrt_60-81.zip https://forums.ppsspp.org/uploads/rsqrt_others.zip
-[Unknown]
Looking at sqrt on [1;4).
sqrtf
in round-to-nearest mode with 2 lowest bits zeroed.3F8920FE[ 1.07131934] -> 3F847C60[ 1.03504562]
3F892100[ 1.07131958] -> 3F847C5C[ 1.03504515]
3F8F6BFE[ 1.12048316] -> 3F877DE0[ 1.0585289]
3F8F6C00[ 1.1204834] -> 3F877DDC[ 1.05852842]
3F93F37E[ 1.15586829] -> 3F899D50[ 1.0751133]
3F93F380[ 1.15586853] -> 3F899D4C[ 1.07511282]
3F9DB8FE[ 1.23220801] -> 3F8E1614[ 1.11004877]
3F9DB900[ 1.23220825] -> 3F8E1610[ 1.11004829]
3F9F4B7E[ 1.24449134] -> 3F8ECAEC[ 1.11556768]
3F9F4B80[ 1.24449158] -> 3F8ECAE8[ 1.11556721]
3FA1ACFE[ 1.2630918] -> 3F8FDB18[ 1.12387371]
3FA1AD00[ 1.26309204] -> 3F8FDB14[ 1.12387323]
3FA3C0FE[ 1.27932715] -> 3F90C704[ 1.13107347]
3FA3C100[ 1.27932739] -> 3F90C700[ 1.131073]
3FA3E77E[ 1.28050208] -> 3F90D808[ 1.13159275]
3FA3E780[ 1.28050232] -> 3F90D804[ 1.13159227]
3FA3EAFE[ 1.28060889] -> 3F90D994[ 1.13163996]
3FA3EB00[ 1.28060913] -> 3F90D990[ 1.13163948]
3FA5B57E[ 1.2946012] -> 3F91A39C[ 1.13780546]
3FA5B580[ 1.29460144] -> 3F91A398[ 1.13780499]
3FAD447E[ 1.35365272] -> 3F94EC74[ 1.16346598]
3FAD4480[ 1.35365295] -> 3F94EC70[ 1.1634655]
3FAD587E[ 1.35426307] -> 3F94F50C[ 1.16372824]
3FAD5880[ 1.35426331] -> 3F94F508[ 1.16372776]
3FADCD7E[ 1.35783362] -> 3F952748[ 1.16526127]
3FADCD80[ 1.35783386] -> 3F952744[ 1.16526079]
3FADD87E[ 1.35816932] -> 3F952C00[ 1.16540527]
3FADD880[ 1.35816956] -> 3F952BFC[ 1.1654048]
3FAFD67E[ 1.37373328] -> 3F960630[ 1.17206383]
3FAFD680[ 1.37373352] -> 3F96062C[ 1.17206335]
3FAFF2FE[ 1.37460303] -> 3F961258[ 1.17243481]
3FAFF300[ 1.37460327] -> 3F961254[ 1.17243433]
3FB38FFE[ 1.40283179] -> 3F979AD0[ 1.184412]
3FB39000[ 1.40283203] -> 3F979ACC[ 1.18441153]
3FB3B57E[ 1.4039762] -> 3F97AAA4[ 1.18489504]
3FB3B580[ 1.40397644] -> 3F97AAA0[ 1.18489456]
3FB7A67E[ 1.43476844] -> 3F99521C[ 1.19781828]
3FB7A680[ 1.43476868] -> 3F995218[ 1.1978178]
3FB7D7FE[ 1.43627906] -> 3F9966C4[ 1.19844866]
3FB7D800[ 1.4362793] -> 3F9966C0[ 1.19844818]
3FB949FE[ 1.44757056] -> 3F9A00D4[ 1.20315027]
3FB94A00[ 1.4475708] -> 3F9A00D0[ 1.2031498]
3FB9CCFE[ 1.45156837] -> 3F9A373C[ 1.20481062]
3FB9CD00[ 1.4515686] -> 3F9A3738[ 1.20481014]
3FBB007E[ 1.46095252] -> 3F9AB6A4[ 1.20869875]
3FBB0080[ 1.46095276] -> 3F9AB6A0[ 1.20869827]
3FBD2A7E[ 1.47785926] -> 3F9B9B28[ 1.21567249]
3FBD2A80[ 1.4778595] -> 3F9B9B24[ 1.21567202]
3FBDBEFE[ 1.48239112] -> 3F9BD830[ 1.21753502]
3FBDBF00[ 1.48239136] -> 3F9BD82C[ 1.21753454]
3FBDF57E[ 1.48405433] -> 3F9BEE90[ 1.21821785]
3FBDF580[ 1.48405457] -> 3F9BEE8C[ 1.21821737]
3FC1CDFE[ 1.51409888] -> 3F9D809C[ 1.23048735]
3FC1CE00[ 1.51409912] -> 3F9D8098[ 1.23048687]
3FC1D27E[ 1.51423621] -> 3F9D8270[ 1.23054314]
3FC1D280[ 1.51423645] -> 3F9D826C[ 1.23054266]
3FC3B57E[ 1.5289762] -> 3F9E4638[ 1.23651791]
3FC3B580[ 1.52897644] -> 3F9E4634[ 1.23651743]
3FC5A9FE[ 1.54425025] -> 3F9F1018[ 1.24267864]
3FC5AA00[ 1.54425049] -> 3F9F1014[ 1.24267817]
3FC5C97E[ 1.54521155] -> 3F9F1CC4[ 1.24306536]
3FC5C980[ 1.54521179] -> 3F9F1CC0[ 1.24306488]
3FC73DFE[ 1.55657935] -> 3F9FB254[ 1.24762964]
3FC73E00[ 1.55657959] -> 3F9FB250[ 1.24762917]
3FC7D27E[ 1.56111121] -> 3F9FEDCC[ 1.24944448]
3FC7D280[ 1.56111145] -> 3F9FEDC8[ 1.24944401]
3FC7E47E[ 1.56166053] -> 3F9FF500[ 1.24966431]
3FC7E480[ 1.56166077] -> 3F9FF4FC[ 1.24966383]
3FC97BFE[ 1.57409644] -> 3FA097B8[ 1.25463009]
3FC97C00[ 1.57409668] -> 3FA097B4[ 1.25462961]
3FCB007E[ 1.58595252] -> 3FA13240[ 1.25934601]
3FCB0080[ 1.58595276] -> 3FA1323C[ 1.25934553]
3FCBAF7E[ 1.5912931] -> 3FA177AC[ 1.2614646]
3FCBAF80[ 1.59129333] -> 3FA177A8[ 1.26146412]
3FCBEF7E[ 1.59324622] -> 3FA19108[ 1.2622385]
3FCBEF80[ 1.59324646] -> 3FA19104[ 1.26223803]
3FCD407E[ 1.60353065] -> 3FA21650[ 1.26630592]
3FCD4080[ 1.60353088] -> 3FA2164C[ 1.26630545]
3FCDA97E[ 1.60673499] -> 3FA23FC0[ 1.2675705]
3FCDA980[ 1.60673523] -> 3FA23FBC[ 1.26757002]
3FCDF7FE[ 1.60913062] -> 3FA25EB4[ 1.26851511]
3FCDF800[ 1.60913086] -> 3FA25EB0[ 1.26851463]
3FD1A1FE[ 1.63775611] -> 3FA3CECC[ 1.27974844]
3FD1A200[ 1.63775635] -> 3FA3CEC8[ 1.27974796]
3FD1D0FE[ 1.63919044] -> 3FA3E128[ 1.28030872]
3FD1D100[ 1.63919067] -> 3FA3E124[ 1.28030825]
3FD3007E[ 1.64845252] -> 3FA45784[ 1.28392076]
3FD30080[ 1.64845276] -> 3FA45780[ 1.28392029]
3FD341FE[ 1.65045142] -> 3FA47104[ 1.28469896]
3FD34200[ 1.65045166] -> 3FA47100[ 1.28469849]
3FD72EFE[ 1.68112159] -> 3FA5F65C[ 1.29658079]
3FD72F00[ 1.68112183] -> 3FA5F658[ 1.29658031]
3FD99AFE[ 1.70004249] -> 3FA6E4C8[ 1.30385685]
3FD99B00[ 1.70004272] -> 3FA6E4C4[ 1.30385637]
3FDBC77E[ 1.71702552] -> 3FA7B9A8[ 1.31035328]
3FDBC780[ 1.71702576] -> 3FA7B9A4[ 1.3103528]
3FDDA77E[ 1.73167396] -> 3FA8706C[ 1.31593084]
3FDDA780[ 1.73167419] -> 3FA87068[ 1.31593037]
3FDDDDFE[ 1.73333716] -> 3FA88520[ 1.31656265]
3FDDDE00[ 1.7333374] -> 3FA8851C[ 1.31656218]
3FDF447E[ 1.74427772] -> 3FA90D10[ 1.32071114]
3FDF4480[ 1.74427795] -> 3FA90D0C[ 1.32071066]
3FDFC17E[ 1.74809241] -> 3FA93C5C[ 1.32215452]
3FDFC180[ 1.74809265] -> 3FA93C58[ 1.32215405]
3FE353FE[ 1.77600074] -> 3FAA94D4[ 1.33266687]
3FE35400[ 1.77600098] -> 3FAA94D0[ 1.3326664]
3FE3BB7E[ 1.77915931] -> 3FAABBA4[ 1.33385134]
3FE3BB80[ 1.77915955] -> 3FAABBA0[ 1.33385086]
3FE3F6FE[ 1.7809751] -> 3FAAD1F0[ 1.33453178]
3FE3F700[ 1.78097534] -> 3FAAD1EC[ 1.33453131]
3FE5007E[ 1.78907752] -> 3FAB354C[ 1.33756399]
3FE50080[ 1.78907776] -> 3FAB3548[ 1.33756351]
3FE531FE[ 1.79058814] -> 3FAB47CC[ 1.33812857]
3FE53200[ 1.79058838] -> 3FAB47C8[ 1.33812809]
3FE78DFE[ 1.80902076] -> 3FAC28E8[ 1.34499836]
3FE78E00[ 1.809021] -> 3FAC28E4[ 1.34499788]
3FE9C8FE[ 1.82644629] -> 3FACFCAC[ 1.35146093]
3FE9C900[ 1.82644653] -> 3FACFCA8[ 1.35146046]
3FE9F0FE[ 1.827667] -> 3FAD0B78[ 1.3519125]
3FE9F100[ 1.82766724] -> 3FAD0B74[ 1.35191202]
3FED707E[ 1.85499549] -> 3FAE5570[ 1.36198235]
3FED7080[ 1.85499573] -> 3FAE556C[ 1.36198187]
3FED8DFE[ 1.85589576] -> 3FAE6044[ 1.36231279]
3FED8E00[ 1.855896] -> 3FAE6040[ 1.36231232]
3FEFAD7E[ 1.87248206] -> 3FAF274C[ 1.36838675]
3FEFAD80[ 1.8724823] -> 3FAF2748[ 1.36838627]
3FF18A7E[ 1.88703895] -> 3FAFD540[ 1.37369537]
3FF18A80[ 1.88703918] -> 3FAFD53C[ 1.3736949]
3FF19CFE[ 1.88760352] -> 3FAFDBFC[ 1.37390089]
3FF19D00[ 1.88760376] -> 3FAFDBF8[ 1.37390041]
3FF1EF7E[ 1.89012122] -> 3FAFFA00[ 1.37481689]
3FF1EF80[ 1.89012146] -> 3FAFF9FC[ 1.37481642]
3FF39EFE[ 1.90328956] -> 3FB096A8[ 1.37959766]
3FF39F00[ 1.90328979] -> 3FB096A4[ 1.37959719]
3FF3CEFE[ 1.9047544] -> 3FB0A80C[ 1.38012838]
3FF3CF00[ 1.90475464] -> 3FB0A808[ 1.38012791]
3FF57AFE[ 1.91781592] -> 3FB142D8[ 1.38485241]
3FF57B00[ 1.91781616] -> 3FB142D4[ 1.38485193]
3FF58DFE[ 1.91839576] -> 3FB149B4[ 1.38506174]
3FF58E00[ 1.918396] -> 3FB149B0[ 1.38506126]
3FF7E3FE[ 1.93664527] -> 3FB22110[ 1.39163399]
3FF7E400[ 1.93664551] -> 3FB2210C[ 1.39163351]
3FF7E9FE[ 1.93682837] -> 3FB22338[ 1.39169979]
3FF7EA00[ 1.93682861] -> 3FB22334[ 1.39169931]
3FF9807E[ 1.94923377] -> 3FB2B508[ 1.39614964]
3FF98080[ 1.94923401] -> 3FB2B504[ 1.39614916]
3FF9E9FE[ 1.95245337] -> 3FB2DACC[ 1.39730215]
3FF9EA00[ 1.95245361] -> 3FB2DAC8[ 1.39730167]
3FF9EF7E[ 1.95262122] -> 3FB2DCC4[ 1.39736223]
3FF9EF80[ 1.95262146] -> 3FB2DCC0[ 1.39736176]
3FFB8A7E[ 1.96516395] -> 3FB36F98[ 1.40184307]
3FFB8A80[ 1.96516418] -> 3FB36F94[ 1.40184259]
3FFDD0FE[ 1.98294044] -> 3FB43EE4[ 1.40816927]
3FFDD100[ 1.98294067] -> 3FB43EE0[ 1.40816879]
3FFDE9FE[ 1.98370337] -> 3FB447C4[ 1.40844011]
3FFDEA00[ 1.98370361] -> 3FB447C0[ 1.40843964]
3FFDEF7E[ 1.98387122] -> 3FB449B8[ 1.40849972]
3FFDEF80[ 1.98387146] -> 3FB449B4[ 1.40849924]
3FFF93FE[ 1.99670386] -> 3FB4DEC0[ 1.41304779]
3FFF9400[ 1.9967041] -> 3FB4DEBC[ 1.41304731]
3FFFE3FE[ 1.99914527] -> 3FB4FB0C[ 1.41391134]
3FFFE400[ 1.99914551] -> 3FB4FB08[ 1.41391087]
3FFFF57E[ 1.99967933] -> 3FB5013C[ 1.41410017]
3FFFF580[ 1.99967957] -> 3FB50138[ 1.41409969]
This is a whole barrel of fun.
I might as well add, that IEEE754 actually mandates that sqrt
is correctly rounded (same as +
,-
,*
,/
,fma
). PSP does not seem to care, maybe does not consider this instruction real sqrt.
The answer is never more than 4 away from bitcast<uint32_t>(sqrtf(input))&-4u
, regardless of rounding direction of sqrtf
(though number of not matching outputs is different: around 1M for FE_TOWARDZERO
to 2M for FE_UPWARD
, with FE_TONEAREST
in between).
All of $2^{21}$ outputs (since bottom 2 bits are 0) in [1;2) are present.
The current implementation is also within 4ULP of output (with 1.2M mismatches).
Ok, assuming we do not find anything better, I think sqrt on [1;4) can be compressed into 256KB table (lerp segments, no exceptions).
Hopefully it just scales the exponent in that way.
Verified on all known data (sans denormals).
Reduction:
float vfpu_sqrt(float x)
{
uint32_t bits;
memcpy(&bits,&x,sizeof(bits));
if((bits&0x7FFFFFFFu)<=0x007FFFFFu)
{
// Denormals (and zeroes) get +0, regardless
// of sign.
return +0.0f;
}
if(bits>>31)
{
// Other negatives get NaN.
bits=0x7F800001u;
memcpy(&x,&bits,sizeof(x));
return x;
}
if((bits>>23)==255u)
{
// Inf/NaN gets Inf/NaN.
bits=0x7F800000u+((bits&0x007FFFFFu)!=0u);
memcpy(&x,&bits,sizeof(bits));
return x;
}
int32_t exponent=int32_t(bits>>23)-127;
// Bottom bit of exponent (inverted) + significand (except bottom bit).
uint32_t index=((bits+0x00800000u)>>1)&0x007FFFFFu;
bits=vfpu_sqrt_fixed(index);
bits+=uint32_t(exponent>>1)<<23;
memcpy(&x,&bits,sizeof(bits));
return x;
}
That's cool. I considered that one possibility of how it computes sqrt internally might be 1/rsqrt (since rsqrt is actually easier to compute with newton-raphson than sqrt), but seems a bit too good unless the rcp (1/x) approximation is really good too..
Table-based vfpu_sqrt_fixed
(256 KB):
What should happen
This is a work-in-progress/research topic, I just thought I'd put it there for people to see.
This concerns implementation of
y=vfpu_sin(x)
, which computes an approximation of $\sin(\frac{\pi}{2} x)$.Technical: I'm sometimes using hex-floats notation for clarity.
Looking at the table the following things appear to be true:
x
always being even integer in this case. How does one even get this wrong? Then again, on x87fsin
instruction returnsx
for $|x|>2^{64}$. Hopefully, games do not rely on this.inf
/nan
the output is specificnan
(0x7F800001
or0xFF800001
, the same sign bit asx
). Note, that this appears to be a signaling nan, though apparently some processors treat it backwards.y
always has two lowest bits as 0 (next one may be 0 or 1) .x
to $[0;1]$. Note: result is-0
whenx
is in [...,-6, -2, -0, +2, +6,...].trunc(x*0x1p+23f)
- i.e. input is effectively??:23
fixed point.y*0x1p+28f
is integer.The above two points mean that the problem is reduced to essentially integer:
a -> b
, wherea=int(0x1p+23f*x)
, andb=int(0x1p+28f*y)
. This gives us a more manageable table of $2^{23}+1$ entries. Attached below: table.bin.zip The contents (in binary) are raw values ofb
as 32bit LE integers (i.e. 4 bytes at offset4*a
are the value ofb
that corresponds toa
). Where the value is unknown (absent in original table)0xFFFFFFFF
is used (not possible to mistake for a real output, since it is larger than $2^{28}$). The table is dense forx>=0.25
, but has only 60 values below:Show
```c // a, b { { 0, 0}, { 1, 50}, { 2, 100}, { 3, 150}, { 4, 201}, { 5, 251}, { 6, 301}, { 7, 351}, { 8, 402}, { 9, 452}, { 10, 502}, { 11, 552}, { 12, 603}, { 13, 653}, { 14, 703}, { 15, 753}, { 16, 804}, { 32, 1608}, { 63, 3166}, { 64, 3216}, { 129, 6484}, { 255, 12817}, { 256, 12867}, { 258, 12968}, { 516, 25937}, { 1023, 51422}, { 1024, 51472}, { 1033, 51924}, { 2066, 103849}, { 4095, 205839}, { 4096, 205889}, { 4132, 207699}, { 4608, 231625}, { 8264, 415397}, { 16383, 823501}, { 16384, 823552}, { 16529, 830840}, { 22528, 1132380}, { 33059, 1661715}, { 65535, 3294065}, { 65536, 3294116}, { 66118, 3323368}, { 132237, 6646276}, { 149130, 7495120}, { 262143, 13171452}, { 262144, 13171504}, { 264474, 13288480}, { 298260, 14984388}, { 528948, 26544376}, { 596520, 29922048}, { 1048575, 52369104}, { 1048576, 52369152}, { 1057896, 52828544}, { 1168434, 58264416}, { 1193040, 59471152}, { 1310720, 65224496}, { 1435360, 71283488}, { 1572864, 77922688}, { 1714176, 84691872}, { 1917732, 94337248}, }; ```Note, that for $|x| < 10^{-3}$ we have $|\sin(\frac{\pi}{2} x)-(\frac{\pi}{2} x)| < 2^{-28}$.
Assuming we can satisfactorily patch the blank spots in the table, this may turn into a compression problem, i.e. even without knowing what HW actually does we just need a way to recreate the exact table. It is, in principle, possible to bite the bullet and just (optionally) ship the table with PPSSPP (not that I actually advocate that) - it is 4MB zipped, and 32MB in mem, which while not nice, might be tolerable.
Please do take this with a grain of salt, it is possible I missed (or messed up) something.
Who would this benefit
If this does indeed improve emulation accuracy, it may make more games correctly playable, to benefit of everyone concerned.
Platform (if relevant)
None
Games this would be useful in
Games that rely on specific values produced by vfpu, not sure which
Other emulators or software with a similar feature
No response
Checklist