Open fp64 opened 1 year ago
Aside, I don't have statistics for how common are tiny triangles, and what percentage of 2x2 quads are full.
Collected some data on that. For "Soulcalibur: Broken Destiny" in 3D scene, the distribution of mask
is:
140271560 [0 0000] <-- full
2274034 [1 0001] +
1965309 [2 0010] | 3 pixels
1971403 [4 0100] |
2274954 [8 1000] +
13912078 [3 0011] +
8473968 [5 0101] |
195576 [6 0110] | 2 pixels
209139 [9 1001] |
7537161 [A 1010] |
13587213 [C 1100] +
4959001 [7 0111] +
4246164 [B 1011] | 1 pixel
4298595 [D 1101] |
4361177 [E 1110] +
57898124 [F 1111] <-- rejected
4 pixels 140271560
3 pixels 8485700
2 pixels 43915135
1 pixel 17864937
0 pixels 57898124
Total: 268435456 quads
Where would the 0 pixel quads come from? Do you count alpha and depth testing there?
No, I just count from mask
here:
So, 0 pixel quads are rejected right away and are pretty cheap.
As for where they come from - if you just walk entire AABB (as some simple half-space rasterizers do) you'll get a lot of 0 pixel quads (a triangle covers 11/36 of its AABB on average, and no more than half). Here NarrowMinMaxX
does trim it down, but it looks like it is somewhat conservative.
Ah right, that makes sense.
BTW, Unknown has done quite a bit of exploration of the raster-in-quads idea in the past.
More obvious example is tiny triangles - a triangle too small, that does not actually cover any sampling points still necessitates testing at least 1 quad. E.g. character's fingers probably average less then 1 pixel per triangle.
I assume this is because the first order of business was to get it right, which is easier with per-pixel functions.
Actually, I did write out quads, but it introduced a lot of complexity and was often neutral, sometimes better, and sometimes worse for overall performance. The changes can be seen here and are mostly dedicated to jit: https://github.com/hrydgard/ppsspp/compare/master...unknownbrackets:ppsspp:softjit-vec (there's probably some things that need cleanup on this branch, but I was hoping for more from this initial go...)
As you mention a lot of time is spent in the barycentric stuff, and so I moved instead to improving parallelism and trying to identify safe cases of rectangles or other things that could calculate simpler and more efficiently (previously a lot of time was wasted on things like full screen triangle strips.)
And the texture lookup seems like the only thing that does not SIMD-ify readily, pre-AVX
Even with AVX2 it's definitely complicated when you think about most textures being CLUT, so you end up in a situation where you're doing a gather from a gather. I have not really attacked that.
A simpler (maybe) thing I've thought about is pre-caching certain textures based on heuristics. We end up with some annoying wasteful cases where we burn a lot of time in the sampler:
The PSP itself had a texture cache (which would be difficult to accurately model for sure), so there might be something to pre-caching and flushing that cache from texflush or texsync GE commands (I forget which, one seems to be related to block transfer actually iirc.) That said, it might make more sense to keep it thread-local to maintain high parallelism. And I'm not certain pre-caching would even be worth it... maybe CPU caches make these cases as fast as I could hope for anyway. I'm pretty sure doing it every draw would not be worth it.
only x86 has SIMD paths for operations on them, and I don't think the default paths auto-vectorize
Well, I've got an 8700 CPU (which I consider somewhat high end) and my goal so far has just been to move from measuring "seconds per frame" to making most games largely playable on my CPU. But I had someone try on an M1 and it ran quite well, I think actually as well or better than on my 8700. It looked like clang was vectorizing decently. As with Intel though, I've put nearly zero thought into 32-bit and only even looked at arm64.
Vec4<float> wsum_recip = EdgeRecip(w0, w1, w2);
This might just be cargo culting - I think you're right that the sum should be constant. That said, I recommend measuring any such changes both with in game and frame dumps because sometimes changes have subtle and unexpected impacts...
Here NarrowMinMaxX does trim it down, but it looks like it is somewhat conservative. E.g. character's fingers probably average less then 1 pixel per triangle.
There's also things like scissor mask and bias that impact that a bit, but you get into an economics problem trying to account for it all, which can impact a game that is in the playable range negatively for a very small benefit to a game that is either already well playable or not very playable anyway. It's probably worth more experimentation, but there's a bunch of tradeoffs for sure.
One reason for the two pixels is that things are intentionally aligned to a 2x2 grid for parallelism reasons. So you can end up with an unfortunate "upper half" and "lower half" but it's worth it overall to make sure you don't risk overlap between threads running different areas of the screen.
That's another area where there could be a lot of gains, but I've struggled to find better heuristics than the current as it's a tricky case of tradeoffs as well. Right now we often end up with say 60% of threads taking on 85% of the work, and if we could even it out more it'd lead to better performance (on multi-core processors.)
-[Unknown]
Also worth noting: the barycentric triangle slice thing is currently the least accurate part of the pipeline. Blending, sampling/filtering, tests, etc. are all pretty darn accurate now. Some of the coordinate rounding seems a bit off before the slicing, but the weighting/etc. is not accurate. There's also the matrix multiplications etc. which are probably inaccurate. But even in through-mode (which skips a lot of that), the weighting is slightly incorrect across a triangle.
Maybe it'll never be accurate and I'm not necessarily trying to make softgpu accurate at all costs, but have definitely aimed to make it accurate where reasonable, and sometimes that has led to perf improvements (some places are accurate to divide by 256 even where you might think it should be 255, or etc.)
What I will say is the "region" registers are very interesting and made me think the PSP might do something similar to what we currently do. You have two registers split into:
For example, suppose I had a scissor set on drawing to constrain to 10,10 - 20,20. Considering only the "distance", my memory is that it goes from the top left of the primitive. So if distance was 5,5... you'd only possibly get 10,10-15,15 filled. Although I'm not sure if it's TL of the primitive before or after scissor.
Then there's rate. This was weird, and no game ever sets it to anything other than 0,0. From memory, this boosts the "consumption rate" of distance at a fixed-point of 2.8. The maximum value, 0x3FF would consume almost 5 "distance" for each pixel, but also "zoom" the primitive out by almost 5 (i.e. texturing, everything else scales too.) The standard value consumes 1 distance per pixel and so is at 1x zoom for the primitive.
But most all documentation names these registers region "x1,y1" and "x2,y2".
Anyway, just a curiosity about how this might work on the PSP I thought worth mentioning.
-[Unknown]
Actually, I did write out quads, but it introduced a lot of complexity and was often neutral, sometimes better, and sometimes worse for overall performance.
Hm, this is interesting (and counterintuitive).
maybe CPU caches make these cases as fast as I could hope for anyway.
Some of it does sound like trying to beat CPU cache at its own game, which I'm not optimistic about. I also remember how I tried to add a special case for "all texel addresses are equal" to my SSE2 toy softrender, and it didn't help much even in best case (it didn't hurt much either). Though here we can bake conversion from underlying format into caching, which may be a win.
It looked like clang was vectorizing decently.
Oh, yes, clang is happy to auto-vectorize on -O2
. And GCC 12+ too. Missed that.
Right now we often end up with say 60% of threads taking on 85% of the work, and if we could even it out more it'd lead to better performance (on multi-core processors.)
Are tasks fine-grained enough to make job-stealing worthwhile?
NarrowMinMaxX
Now that I think about it, it might be faster to skip NarrowMinMaxX
for tiny ranges (say <=8 pixels), where extra division probably hurts more than it saves (I assume it is purely optimization thing, and does not affect correctness).
One reason for the two pixels is that things are intentionally aligned to a 2x2 grid for parallelism reasons.
In DrawTriangleSlice
, for both minX
and minY
I'm seeing values that are odd multiples of SCREEN_SCALE_FACTOR
. Or do you mean something else?
but some operations seem missing (notably bitwise stuff).
Well, Cast
is also not SIMD-ified. It features prominently in Interpolate
. The auto-vectorization results are interesting. If I'm reading this (it's just a simple synthetic example; not what PPSSPP does) right:
unpcklps
in float->int conversion?Cast
, but goes scalar when inlined. What?Hm, this is interesting (and counterintuitive).
I have thought about trying a few different things. I was initially a bit overzealous in trying to avoid early-exits maybe (could branch if all pixels fail alpha test), and maybe going from 2x2 to 4x1 would be better (possible con: maybe worse for tall and thin, which are not that uncommon.)
I'm seeing values that are odd multiples of SCREEN_SCALE_FACTOR. Or do you mean something else?
Hm, at least if it's split into bins, I remembered it being aligned. But maybe it was something I was planning to do or once did and decided wasn't worth it? It has been through various changes.
Well, Cast is also not SIMD-ified. It features prominently in Interpolate.
That's interesting and does look pretty terrible. I do see some improvement from changing that, even with some of our specializations. Will just do a quick pull.
-[Unknown]
Doing as many early-exits as possible is probably quite beneficial, it's running on a CPU after all, not a GPU.
Doing as many early-exits as possible is probably quite beneficial, it's running on a CPU after all, not a GPU.
My intuition is the same, but needs testing, obviously.
Tangentially related, but I'm considering PR to centralize 32-bit x86 workaround along the lines of:
#if PPSSPP_ARCH(X86)
// [Detailed rationale here].
#define SAFE_M128(v) _mm_loadu_ps (reinterpret_cast<const float*> (&(v)))
#define SAFE_M128I(v) _mm_loadu_si128(reinterpret_cast<const __m128i*>(&(v)))
#else // x64, FWIW, also works on non-x86.
#define SAFE_M128(v) (v)
#define SAFE_M128I(v) (v)
#endif
to eliminate #ifdef
's at usage site.
Open to the suggestions for better names too.
In several places division is used instead of a shift. When signed, complier has to insert extra logic to match truncating division, which can hurt perf (unless truncating division is both possible and the desired behaviour).
To give just one example, here's GetPixelDataOffset
(swizzled):
https://godbolt.org/z/37KbTdxnE
That looks like it might be a worthwhile improvement, where applicable. Feel free to PR.
Note that this function, too, is not normally called very often. But it probably should be uint32_t.
-[Unknown]
x86 with JIT aside, wouldn't it get called quite often on ARM (4 times per SampleLinearLevel
)?
I think this can be simplified:
static inline void GetTextureCoordinatesProj(const VertexData& v0, const VertexData& v1, const float p, float &s, float &t) {
// This is for texture matrix projection.
float q0 = 1.f / v0.clipw;
float q1 = 1.f / v1.clipw;
float wq0 = p * q0;
float wq1 = (1.0f - p) * q1;
float q_recip = 1.0f / (wq0 + wq1);
float q = (v0.texturecoords.q() * wq0 + v1.texturecoords.q() * wq1) * q_recip;
q_recip *= 1.0f / q;
s = (v0.texturecoords.s() * wq0 + v1.texturecoords.s() * wq1) * q_recip;
t = (v0.texturecoords.t() * wq0 + v1.texturecoords.t() * wq1) * q_recip;
}
Namely, float q_recip = 1.0f / (v0.texturecoords.q() * wq0 + v1.texturecoords.q() * wq1);
. This is not so much about speed, the current code just looks a bit weird (unless it is doing this on purpose, to get NaNs right or something).
There's a trick where you calculate 2 reciprocals for price of 1:
// Naive.
float inv_a = 1.0f / a;
float inv_b = 1.0f / b;
// Trick.
float inv_ab=1.0f / (a*b);
float inv_a = inv_ab * b;
float inv_b = inv_ab * a;
This can also be extended to 3 or more reciprocals, with more overhead. Does come with a lot of fine print, though (zeroes botch everything, non-exact where it could be), and might not be worth it anyway (CPUs can run several divisions in parallel).
Looks like constants only need to be computed once (doubt that compiler hoists them, since it doesn't see const RasterizerState
as a guarantee that the contents would not change):
https://github.com/hrydgard/ppsspp/blob/42d4b5d41d7a742622262da223c96e9e70dbb8cf/GPU/Software/Rasterizer.cpp#L1126-L1127
x86 with JIT aside, wouldn't it get called quite often on ARM (4 times per
SampleLinearLevel
)?
Fair enough.
GetTextureCoordinatesProj
This is probably a case of tweaking something that was slightly wrong and then not fixing it, oops. Yes, should make sense to simplify.
There's a trick where you calculate 2 reciprocals for price of 1:
If w is zero or infnan, generally graphics won't draw anyway on a PSP. That said, unless it realizes a measurable FPS gain, I'd be wary of doing things like that.
Looks like constants only need to be computed once
Yeah, I've tried this and it didn't help much and may have hurt. Maybe it's worth doing, but in this case the code only runs in throughmode (no matrix transforms, used for 2D drawing typically, most often low triangle counts.) And another if per triangle hurts all those tiny hand/finger parts of 3D models, although only very slightly. But at least before, it was such a small difference I couldn't tell it from error with or without.
-[Unknown]
Hm, IsRightSideOrFlatBottomLine
looks a bit strange.
Normally (assuming you know the winding order), you only need 2 points (the edge itself) to determine if edge function needs bias to satisfy the fill rules, and no division.
For traditional top-left rule, it's something like this:
// Get bias for oriented edge A->B, assuming (visible) triangles are CCW
// in Cartesian (y - up) coordinates,
static inline int edge_function_bias(const Vec2<int> &A, const Vec2<int> &B)
{
bool is_top_left=(B.y==A.y&&B.x<A.x)||(B.y<A.y);
return is_top_left?0:-1;
}
Not sure what fill rules are for PSP, but this feels off (at least the implementation; semantics might be correct).
Looks like IsRightSideOrFlatBottomLine
does not return quite what it claims.
Here is the function in it's current form:
static inline bool IsRightSideOrFlatBottomLine(const Vec2<int>& vertex, const Vec2<int>& line1, const Vec2<int>& line2)
{
if (line1.y == line2.y) {
// just check if vertex is above us => bottom line parallel to x-axis
return vertex.y < line1.y;
} else {
// check if vertex is on our left => right side
return vertex.x < line1.x + (line2.x - line1.x) * (vertex.y - line1.y) / (line2.y - line1.y);
}
}
Consider:
Vec2<int> A( 63, -59);
Vec2<int> B(-49, 73);
Vec2<int> C( 67, -64);
bool result = IsRightSideOrFlatBottomLine(C, A, B); // <-- false.
However, in a (rather thin) ABC we have line1.x + (line2.x - line1.x) * double(vertex.y - line1.y) / (line2.y - line1.y) == 67.24...
, so vertex.x
is less than that.
Edge AB also happens to pass through (7,7) which TriangleEdge::Start
considers a sampling point (pixel center, as SCREEN_SCALE_FACTOR = 16
), and so bias here does affect which pixels get covered (I think).
The example for the other winding order is Vec2<int> A( 2, 58); Vec2<int> B( 12, -44); Vec2<int> C( 8, -6);
(also passes through (7,7)).
I think what it tries to compute can be written as:
static inline bool IsRightSideOrFlatBottomLine(const Vec2<int>& vertex, const Vec2<int>& line1, const Vec2<int>& line2)
{
if (line1.y == line2.y) {
// just check if vertex is above us => bottom line parallel to x-axis
return vertex.y < line1.y;
} else {
// check if vertex is on our left => right side
int64_t delta = int64_t(vertex.x - line1.x) * int64_t(line2.y - line1.y) - int64_t(line2.x - line1.x) * int64_t(vertex.y - line1.y);
return line2.y > line1.y ? (delta < 0) : (delta > 0);
}
}
The delta
is, unsurprisingly, just a 2x2 determinant, i.e. double the triangle's oriented area.
Experiments indicate that discrepancy occurs with about 0.0038% chance for vertices uniformly distributed in (480*16 x 272*16
), but much more likely for tiny triangles (>0.5%). The IsRightSideOrFlatBottomLine
seems conservative, i.e. it can wrongly return false
, but not true
. Meaning, there should be no cracks, but potential overdraw, which probably only matters for semi-transparent stuff (and maybe some stencil things).
The compilers don't care to optimize return mask.x >= 0 || mask.y >= 0 || mask.z >= 0 || mask.w >= 0;
(fallback in AnyMask
) to branchless return (mask.x & mask.y & mask.z & mask.w) >= 0
. Except, funnily enough, on NEON version something vaguely similar seems to happen, contrary to what actual intrinsics state: https://godbolt.org/z/dos4rfrcY . Likely not worth bothering, just thought it amusing.
Hm,
IsRightSideOrFlatBottomLine
looks a bit strange.
Yeah, that function is 10 years old and I think came from Dolphin: https://github.com/hrydgard/ppsspp/commit/b0d3848dc74ac44ef705f782f7f54d611354fb95
We used to have much worse coverage issues, but they were caused by other worse problems. I think we probably don't really have a test that directly covers this. The closest we have are tests.py -m gpu/primitives/
or gpu/triangle/triangle. You'll see that gpu/triangle/triangle fails from linear interpolation mainly (it generates __testcompare.png which helps comparing the pixels.)
Should probably create a specific coverage test for each primitive, I guess, and check right to left order, etc.
-[Unknown]
https://github.com/hrydgard/ppsspp/blob/f3d95a2cef1158b6fd02d8d99d6da1cc84ea3d8e/GPU/Software/SamplerX86.cpp#L1080-L1088 Loading 32 bits at a time does seem consistently faster in SSE2: https://godbolt.org/z/1j75fz9aM
Instruction set | Path | Speed |
---|---|---|
SSE2 | movd | 10.4 GB/s |
SSE2 | pinsrw | 8.4 GB/s |
SSE4.1 | pinsrd | 12.6 GB/s |
AVX2 | vpgatherdd | 14.5 GB/s |
I sometimes get a case where vpgatherdd
is slower than pinsrd
-based version.
Might be not worth it especially for rare path.
Just to mention the obvious. Since w0+w1+w2=const, interpolation can be rewritten as w0*A0+w1*A1+w2*A1
= w0*A0+w1*A1+(W-w0-w1)*A2
= w0*(A0-A2)+w1*(A1-A2)+W*A2
, where W=w0+w1+w2. This can be a win for constant A0, A1, A2 (since the coefficients can be precomputed; you can also bake in division by W if need be). You can even parametrize by screen X,Y, rather than w0,w1.
There may be some accuracy issues though (e.g. w0=1, w1=0, w2=0 does not guarantee A0 output).
Loading 32 bits at a time does seem consistently faster in SSE2:
Well, wait, we can't use PINSRD in this path - we use that for SSE4. The question here is, would it be better to MOVD from RAM and then blend? Probably only relevant to try that on a pre-SSE4 CPU. So I just left it as two PINSRW.
I sometimes get a case where
vpgatherdd
is slower thanpinsrd
-based version.
At least on a Coffee Lake, I had pretty significant gains from using VPGATHERDD in PPSSPP (#15275.) Sure have come a long way since those speeds, heh.
This can be a win for constant A0, A1, A2 (since the coefficients can be precomputed; you can also bake in division by W if need be).
Hm, this sounds like it could be a good idea. Might have to be careful about any accuracy impacts as you mention, although since we go at half pixels I don't know how often any weight lands directly on zero. But of course, it can happen.
-[Unknown]
The path named movd
in the table (load32x4_movd
in godbolt example) does not use PINSRD
, and looks about 20% faster than PINSRW
-based version there.
Probably only relevant to try that on a pre-SSE4 CPU.
On my (pre-SSE4) machine (adapted the same code from godbolt):
movd : 1.509 GB/s
pinsrw : 1.025 GB/s
That said, loading to temporary in mem may be more hassle in JIT, so its understandable to not bother with it for (probably) rarely used path.
On a different note, out of curiosity, I tried
LIBGL_ALWAYS_SOFTWARE=true ./PPSSPPSDL
and it was about twice as fast as PPSSPP's SW rendering (14 FPS, vs 7; though still lot slower than OpenGL which runs at full 60 FPS) in 3D scene in Soulcalibur. Of course, it is not a fair comparison, since SW renderer does some things that OpenGL simply doesn't (e.g. dithering, though that one is probably relatively cheap). Yet, this suggests there is still room for growth.
On windows, to test this, you can simply place Mesa3D's opengl32.dll
next to PPSSPP's executable (this worked fine when I tried it). I think you can get the dll from e.g. https://github.com/pal1000/mesa-dist-win or https://download.qt.io/development_releases/prebuilt/llvmpipe/windows/ (I haven't tested these specific ones).
Running our hardware renderers in an outside software rasterizer is "cheating" in a number of ways - they will benefit from pre-decoded cached textures, for example, while our software renderer decodes on every access, and will inherit all the other inaccuracies of the hardware renderers, of course.
What should happen
As mentioned, some (possibly dumb; I'm trying to wrap my head around it) observations on SoftGPU.
First off, here's what performance looks like (Linux, 32-bit x86 with SSE2 but without SSE4):
This is me playing a couple of minutes of "Soulcalibur: Broken Destiny", mostly a 3D scene.
The profile is flat, without
--callgraph
- it just shows where (and how often) samples land.Even without JIT,
Rasterizer::DrawSinglePixel
accounts for surprisingly little - much less thanSampler::SampleLinearLevel
(there might be some games that do not use linear much). A lot of time is spent inRasterizer::DrawTriangleSlice
itself (and whatever is inlined into it).Now, looking at the code, the biggest thing is that most of processing is done per pixel. The
DrawTriangleSlice
actually works on 2x2 pixel quads, but then most of the actual processing is per-pixel inside quad. All ofstate.drawPixel
,state.nearest
, andstate.linear
are per-pixel, even when JIT-ed. I assume this is because the first order of business was to get it right, which is easier with per-pixel functions.Converting it to entirely quad-based seems like a lot of work (especially since it involves both JIT and non-JIT parts, in sync). It also seems lucrative for performance. Even purely scalar quad-based versions are likely to be faster their counterparts, since various
if(state.whatever)
would be amortized. And the texture lookup seems like the only thing that does not SIMD-ify readily, pre-AVX (and emulatinggather
in plain scalar+SSE is not even that bad). Going by the names likeDrawSinglePixel
the idea seems to have been there all along. Aside, I don't have statistics for how common are tiny triangles, and what percentage of 2x2 quads are full.The
Vec3<...>
andVec4<...>
are used extensively, but some operations seem missing (notably bitwise stuff). Also, only x86 has SIMD paths for operations on them, and I don't think the default paths auto-vectorize, since auto-vectorization is under-O3
, but PPSSPP uses-O2
. Not sure if performance on ARM is a concern.Do not see why https://github.com/hrydgard/ppsspp/blob/a56f74c8c0781b36ef0248d71482244a75ecccba/GPU/Software/Rasterizer.cpp#L963 is per-pixel (per-quad). Normally, w0+w1+w2=const invariant holds for entire triangle (the entire screen, actually), unless some weird per-edge scaling is done. When I tried computing it once per
DrawTriangleSlice
there appeared no visible problems.Who would this benefit
-
Platform (if relevant)
None
Games this would be useful in
-
Other emulators or software with a similar feature
No response
Checklist