SoftGPU perf opportunities

fp64 commented 1 year ago

What should happen

As mentioned, some (possibly dumb; I'm trying to wrap my head around it) observations on SoftGPU.

First off, here's what performance looks like (Linux, 32-bit x86 with SSE2 but without SSE4):

``` $ perf record ./PPSSPPSDL $ perf report --stdio | head -n 100 # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 3M of event 'cycles:u' # Event count (approx.): 2130104713894 # # Overhead Command Shared Object Symbol # ........ ............... ....................... ................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. # 10.82% PoolWorker 0 PPSSPPSDL [.] Rasterizer::DrawTriangleSlice 10.67% PoolWorker 1 PPSSPPSDL [.] Rasterizer::DrawTriangleSlice 3.02% PPSSPPSDL r600_dri.so [.] 0x002bcd15 3.01% PPSSPPSDL r600_dri.so [.] 0x002bcd23 2.65% PoolWorker 0 PPSSPPSDL [.] Sampler::SampleLinearLevel 2.54% PoolWorker 1 PPSSPPSDL [.] Sampler::SampleLinearLevel 2.23% Emu PPSSPPSDL [.] Rasterizer::DrawTriangleSlice 2.07% PoolWorker 0 PPSSPPSDL [.] .L426 2.06% PoolWorker 0 PPSSPPSDL [.] Rasterizer::GetPixelColor 2.04% PoolWorker 1 PPSSPPSDL [.] .L426 1.96% PoolWorker 1 PPSSPPSDL [.] Rasterizer::GetPixelColor 1.75% PoolWorker 0 PPSSPPSDL [.] Sampler::LookupColor 1.67% PoolWorker 0 PPSSPPSDL [.] Rasterizer::SetPixelColor 1.63% PoolWorker 1 PPSSPPSDL [.] Sampler::LookupColor 1.60% PoolWorker 1 PPSSPPSDL [.] Rasterizer::SetPixelColor 1.41% PoolWorker 0 PPSSPPSDL [.] Rasterizer::ApplyTexturing 1.39% PoolWorker 1 PPSSPPSDL [.] Rasterizer::ApplyTexturing 1.36% PoolWorker 0 PPSSPPSDL [.] Sampler::TransformClutIndex 1.27% PoolWorker 0 PPSSPPSDL [.] .L427 1.27% PoolWorker 1 PPSSPPSDL [.] Sampler::TransformClutIndex 1.17% PoolWorker 1 PPSSPPSDL [.] .L427 1.15% PoolWorker 0 PPSSPPSDL [.] Sampler::SampleLinear 1.14% PoolWorker 0 PPSSPPSDL [.] Rasterizer::DrawSinglePixel 1.02% PoolWorker 1 PPSSPPSDL [.] Sampler::SampleLinear 1.01% PoolWorker 0 PPSSPPSDL [.] Sampler::SampleNearest 1.01% PoolWorker 1 PPSSPPSDL [.] Rasterizer::DrawSinglePixel 0.99% PoolWorker 1 PPSSPPSDL [.] Rasterizer::DrawSinglePixel 0.99% PoolWorker 0 PPSSPPSDL [.] Rasterizer::DrawSinglePixel 0.92% PoolWorker 0 PPSSPPSDL [.] .L422 0.91% PoolWorker 0 PPSSPPSDL [.] .L774 0.88% PoolWorker 1 PPSSPPSDL [.] .L422 0.87% PoolWorker 1 PPSSPPSDL [.] .L774 0.85% Emu libm-2.31.so [.] ceilf32 0.74% Emu PPSSPPSDL [.] Lighting::ProcessSIMD 0.74% PoolWorker 1 PPSSPPSDL [.] Sampler::SampleNearest 0.73% PoolWorker 0 PPSSPPSDL [.] .L490 0.67% PoolWorker 1 PPSSPPSDL [.] .L490 0.59% PoolWorker 0 PPSSPPSDL [.] Sampler::GetTextureFunctionOutput 0.54% PoolWorker 0 PPSSPPSDL [.] .L724 0.53% PoolWorker 1 PPSSPPSDL [.] Rasterizer::DrawRectangle 0.53% PoolWorker 0 PPSSPPSDL [.] Rasterizer::DrawRectangle 0.50% PoolWorker 1 PPSSPPSDL [.] .L724 0.49% PoolWorker 0 PPSSPPSDL [.] .L721 0.48% PoolWorker 1 PPSSPPSDL [.] Sampler::GetTextureFunctionOutput 0.48% Emu PPSSPPSDL [.] TransformUnit::ReadVertex 0.46% PoolWorker 1 PPSSPPSDL [.] Rasterizer::CheckDepthTestPassed 0.44% PoolWorker 1 PPSSPPSDL [.] .L721 0.43% PoolWorker 0 PPSSPPSDL [.] Rasterizer::CheckDepthTestPassed 0.43% Emu PPSSPPSDL [.] BinManager::AddTriangle 0.42% PPSSPPSDL r600_dri.so [.] 0x000ef115 0.41% PoolWorker 0 PPSSPPSDL [.] .L775 0.41% PoolWorker 0 PPSSPPSDL [.] __x86.get_pc_thunk.bx 0.41% Emu PPSSPPSDL [.] .L338 0.40% PoolWorker 1 PPSSPPSDL [.] .L775 0.39% Emu PPSSPPSDL [.] SoftGPU::FastRunLoop 0.37% PoolWorker 0 PPSSPPSDL [.] Rasterizer::DrawSprite 0.36% PoolWorker 1 PPSSPPSDL [.] __x86.get_pc_thunk.bx 0.34% Emu PPSSPPSDL [.] Sampler::SampleLinearLevel 0.32% Emu PPSSPPSDL [.] ClipToScreenInternal 0.29% PoolWorker 0 PPSSPPSDL [.] Sampler::SampleNearest<1> 0.28% PoolWorker 1 PPSSPPSDL [.] Rasterizer::DrawSprite 0.28% PoolWorker 0 PPSSPPSDL [.] .L44 0.27% Emu PPSSPPSDL [.] .L427 0.27% PoolWorker 1 PPSSPPSDL [.] __x86.get_pc_thunk.ax 0.27% PoolWorker 0 PPSSPPSDL [.] __x86.get_pc_thunk.ax 0.25% PoolWorker 0 PPSSPPSDL [.] .L485 0.25% PoolWorker 1 PPSSPPSDL [.] .L44 0.23% PoolWorker 1 PPSSPPSDL [.] .L485 0.23% Emu PPSSPPSDL [.] Clipper::ProcessTriangle 0.23% PPSSPPSDL r600_dri.so [.] 0x002c0488 0.23% Emu PPSSPPSDL [.] .L285 0.23% Emu PPSSPPSDL [.] Rasterizer::ApplyTexturing 0.22% PoolWorker 0 PPSSPPSDL [.] __x86.get_pc_thunk.si 0.22% Emu PPSSPPSDL [.] Sampler::LookupColor 0.21% PoolWorker 1 PPSSPPSDL [.] __x86.get_pc_thunk.si 0.21% PoolWorker 0 PPSSPPSDL [.] .L377 0.21% PoolWorker 1 PPSSPPSDL [.] Sampler::SampleNearest<1> 0.21% PoolWorker 0 PPSSPPSDL [.] .L748 0.21% Emu PPSSPPSDL [.] Rasterizer::GetPixelColor 0.19% Emu PPSSPPSDL [.] .L426 0.19% PoolWorker 1 PPSSPPSDL [.] .L748 0.19% Emu PPSSPPSDL [.] Sampler::TransformClutIndex 0.18% PoolWorker 1 PPSSPPSDL [.] .L374 0.18% PoolWorker 0 PPSSPPSDL [.] .L374 0.18% Emu PPSSPPSDL [.] ConvertBGRA5551ToABGR1555 0.17% Emu PPSSPPSDL [.] Rasterizer::DrawSinglePixel 0.17% Emu PPSSPPSDL [.] Rasterizer::SetPixelColor 0.17% Emu PPSSPPSDL [.] Rasterizer::CalculateRasterStateFlags 0.17% Emu PPSSPPSDL [.] Sampler::SampleNearest ```

This is me playing a couple of minutes of "Soulcalibur: Broken Destiny", mostly a 3D scene.

The profile is flat, without --callgraph - it just shows where (and how often) samples land.

Even without JIT, Rasterizer::DrawSinglePixel accounts for surprisingly little - much less than Sampler::SampleLinearLevel (there might be some games that do not use linear much). A lot of time is spent in Rasterizer::DrawTriangleSlice itself (and whatever is inlined into it).

Now, looking at the code, the biggest thing is that most of processing is done per pixel. The DrawTriangleSlice actually works on 2x2 pixel quads, but then most of the actual processing is per-pixel inside quad. All of state.drawPixel, state.nearest, and state.linear are per-pixel, even when JIT-ed. I assume this is because the first order of business was to get it right, which is easier with per-pixel functions.

Converting it to entirely quad-based seems like a lot of work (especially since it involves both JIT and non-JIT parts, in sync). It also seems lucrative for performance. Even purely scalar quad-based versions are likely to be faster their counterparts, since various if(state.whatever) would be amortized. And the texture lookup seems like the only thing that does not SIMD-ify readily, pre-AVX (and emulating gather in plain scalar+SSE is not even that bad). Going by the names like DrawSinglePixel the idea seems to have been there all along. Aside, I don't have statistics for how common are tiny triangles, and what percentage of 2x2 quads are full.

The Vec3<...> and Vec4<...> are used extensively, but some operations seem missing (notably bitwise stuff). Also, only x86 has SIMD paths for operations on them, and I don't think the default paths auto-vectorize, since auto-vectorization is under -O3, but PPSSPP uses -O2. Not sure if performance on ARM is a concern.

Do not see why https://github.com/hrydgard/ppsspp/blob/a56f74c8c0781b36ef0248d71482244a75ecccba/GPU/Software/Rasterizer.cpp#L963 is per-pixel (per-quad). Normally, w0+w1+w2=const invariant holds for entire triangle (the entire screen, actually), unless some weird per-edge scaling is done. When I tried computing it once per DrawTriangleSlice there appeared no visible problems.

Who would this benefit

-

Platform (if relevant)

None

Games this would be useful in

-

Other emulators or software with a similar feature

No response

Checklist

[X] Check the latest git build in case it's already implemented.
[X] Search for other requests of the same feature.

fp64 commented 1 year ago

Aside, I don't have statistics for how common are tiny triangles, and what percentage of 2x2 quads are full.

Collected some data on that. For "Soulcalibur: Broken Destiny" in 3D scene, the distribution of mask is:

 140271560 [0 0000] <-- full

   2274034 [1 0001] +
   1965309 [2 0010] | 3 pixels
   1971403 [4 0100] |
   2274954 [8 1000] +

  13912078 [3 0011] +
   8473968 [5 0101] | 
    195576 [6 0110] | 2 pixels
    209139 [9 1001] |
   7537161 [A 1010] |
  13587213 [C 1100] +

   4959001 [7 0111] +
   4246164 [B 1011] | 1 pixel
   4298595 [D 1101] |
   4361177 [E 1110] +

  57898124 [F 1111] <-- rejected

4 pixels 140271560 
3 pixels   8485700
2 pixels  43915135
1 pixel   17864937
0 pixels  57898124 
Total: 268435456 quads

hrydgard commented 1 year ago

Where would the 0 pixel quads come from? Do you count alpha and depth testing there?

fp64 commented 1 year ago

No, I just count from mask here:

https://github.com/hrydgard/ppsspp/blob/a56f74c8c0781b36ef0248d71482244a75ecccba/GPU/Software/Rasterizer.cpp#L960C1-L962C33

So, 0 pixel quads are rejected right away and are pretty cheap.

As for where they come from - if you just walk entire AABB (as some simple half-space rasterizers do) you'll get a lot of 0 pixel quads (a triangle covers 11/36 of its AABB on average, and no more than half). Here NarrowMinMaxX does trim it down, but it looks like it is somewhat conservative.

hrydgard commented 1 year ago

Ah right, that makes sense.

BTW, Unknown has done quite a bit of exploration of the raster-in-quads idea in the past.

fp64 commented 1 year ago

More obvious example is tiny triangles - a triangle too small, that does not actually cover any sampling points still necessitates testing at least 1 quad. E.g. character's fingers probably average less then 1 pixel per triangle.

unknownbrackets commented 1 year ago

I assume this is because the first order of business was to get it right, which is easier with per-pixel functions.

Actually, I did write out quads, but it introduced a lot of complexity and was often neutral, sometimes better, and sometimes worse for overall performance. The changes can be seen here and are mostly dedicated to jit: https://github.com/hrydgard/ppsspp/compare/master...unknownbrackets:ppsspp:softjit-vec (there's probably some things that need cleanup on this branch, but I was hoping for more from this initial go...)

As you mention a lot of time is spent in the barycentric stuff, and so I moved instead to improving parallelism and trying to identify safe cases of rectangles or other things that could calculate simpler and more efficiently (previously a lot of time was wasted on things like full screen triangle strips.)

And the texture lookup seems like the only thing that does not SIMD-ify readily, pre-AVX

Even with AVX2 it's definitely complicated when you think about most textures being CLUT, so you end up in a situation where you're doing a gather from a gather. I have not really attacked that.

A simpler (maybe) thing I've thought about is pre-caching certain textures based on heuristics. We end up with some annoying wasteful cases where we burn a lot of time in the sampler:

Games will sometimes point the texture address at a regular looking texture, then point the UVs at a tiny 2x2 square with the same colors, but then draw some large background with linear filtering on. So it's a fixed color actually, but we're left sampling it over and over. At least it's all in cache, but still expensive.
There are other cases where a thin strip of a texture is used (i.e. water, sky), so you end up with a small number of texels informing a ton of pixels (though not equal.)
Other times, linear filtering will be used even when the texture is at or close to 1:1. Currently we end up decoding each texel multiple times in this case (to linear interp them.)
Often many draws may reference the same areas of textures, for example when drawing a series of walls.

The PSP itself had a texture cache (which would be difficult to accurately model for sure), so there might be something to pre-caching and flushing that cache from texflush or texsync GE commands (I forget which, one seems to be related to block transfer actually iirc.) That said, it might make more sense to keep it thread-local to maintain high parallelism. And I'm not certain pre-caching would even be worth it... maybe CPU caches make these cases as fast as I could hope for anyway. I'm pretty sure doing it every draw would not be worth it.

only x86 has SIMD paths for operations on them, and I don't think the default paths auto-vectorize

Well, I've got an 8700 CPU (which I consider somewhat high end) and my goal so far has just been to move from measuring "seconds per frame" to making most games largely playable on my CPU. But I had someone try on an M1 and it ran quite well, I think actually as well or better than on my 8700. It looked like clang was vectorizing decently. As with Intel though, I've put nearly zero thought into 32-bit and only even looked at arm64.

Vec4<float> wsum_recip = EdgeRecip(w0, w1, w2);

This might just be cargo culting - I think you're right that the sum should be constant. That said, I recommend measuring any such changes both with in game and frame dumps because sometimes changes have subtle and unexpected impacts...

Here NarrowMinMaxX does trim it down, but it looks like it is somewhat conservative. E.g. character's fingers probably average less then 1 pixel per triangle.

There's also things like scissor mask and bias that impact that a bit, but you get into an economics problem trying to account for it all, which can impact a game that is in the playable range negatively for a very small benefit to a game that is either already well playable or not very playable anyway. It's probably worth more experimentation, but there's a bunch of tradeoffs for sure.

One reason for the two pixels is that things are intentionally aligned to a 2x2 grid for parallelism reasons. So you can end up with an unfortunate "upper half" and "lower half" but it's worth it overall to make sure you don't risk overlap between threads running different areas of the screen.

That's another area where there could be a lot of gains, but I've struggled to find better heuristics than the current as it's a tricky case of tradeoffs as well. Right now we often end up with say 60% of threads taking on 85% of the work, and if we could even it out more it'd lead to better performance (on multi-core processors.)

-[Unknown]

unknownbrackets commented 1 year ago

Also worth noting: the barycentric triangle slice thing is currently the least accurate part of the pipeline. Blending, sampling/filtering, tests, etc. are all pretty darn accurate now. Some of the coordinate rounding seems a bit off before the slicing, but the weighting/etc. is not accurate. There's also the matrix multiplications etc. which are probably inaccurate. But even in through-mode (which skips a lot of that), the weighting is slightly incorrect across a triangle.

Maybe it'll never be accurate and I'm not necessarily trying to make softgpu accurate at all costs, but have definitely aimed to make it accurate where reasonable, and sometimes that has led to perf improvements (some places are accurate to divide by 256 even where you might think it should be 255, or etc.)

What I will say is the "region" registers are very interesting and made me think the PSP might do something similar to what we currently do. You have two registers split into:

X and Y "rate"
X and Y "distance"

For example, suppose I had a scissor set on drawing to constrain to 10,10 - 20,20. Considering only the "distance", my memory is that it goes from the top left of the primitive. So if distance was 5,5... you'd only possibly get 10,10-15,15 filled. Although I'm not sure if it's TL of the primitive before or after scissor.

Then there's rate. This was weird, and no game ever sets it to anything other than 0,0. From memory, this boosts the "consumption rate" of distance at a fixed-point of 2.8. The maximum value, 0x3FF would consume almost 5 "distance" for each pixel, but also "zoom" the primitive out by almost 5 (i.e. texturing, everything else scales too.) The standard value consumes 1 distance per pixel and so is at 1x zoom for the primitive.

But most all documentation names these registers region "x1,y1" and "x2,y2".

Anyway, just a curiosity about how this might work on the PSP I thought worth mentioning.

-[Unknown]

fp64 commented 1 year ago

Actually, I did write out quads, but it introduced a lot of complexity and was often neutral, sometimes better, and sometimes worse for overall performance.

Hm, this is interesting (and counterintuitive).

maybe CPU caches make these cases as fast as I could hope for anyway.

Some of it does sound like trying to beat CPU cache at its own game, which I'm not optimistic about. I also remember how I tried to add a special case for "all texel addresses are equal" to my SSE2 toy softrender, and it didn't help much even in best case (it didn't hurt much either). Though here we can bake conversion from underlying format into caching, which may be a win.

It looked like clang was vectorizing decently.

Oh, yes, clang is happy to auto-vectorize on -O2. And GCC 12+ too. Missed that.

Right now we often end up with say 60% of threads taking on 85% of the work, and if we could even it out more it'd lead to better performance (on multi-core processors.)

Are tasks fine-grained enough to make job-stealing worthwhile?

NarrowMinMaxX

Now that I think about it, it might be faster to skip NarrowMinMaxX for tiny ranges (say <=8 pixels), where extra division probably hurts more than it saves (I assume it is purely optimization thing, and does not affect correctness).

fp64 commented 1 year ago

One reason for the two pixels is that things are intentionally aligned to a 2x2 grid for parallelism reasons.

In DrawTriangleSlice, for both minX and minY I'm seeing values that are odd multiples of SCREEN_SCALE_FACTOR. Or do you mean something else?

fp64 commented 1 year ago

but some operations seem missing (notably bitwise stuff).

Well, Cast is also not SIMD-ified. It features prominently in Interpolate. The auto-vectorization results are interesting. If I'm reading this (it's just a simple synthetic example; not what PPSSPP does) right:

Older GCC goes scalar.
GCC 12+ vectorizes ok.
Clang vectorizes weird. Why would one unpcklps in float->int conversion?
MSVC vectorizes isolated Cast, but goes scalar when inlined. What?
ARM GCC... see for yourself.

unknownbrackets commented 1 year ago

Hm, this is interesting (and counterintuitive).

I have thought about trying a few different things. I was initially a bit overzealous in trying to avoid early-exits maybe (could branch if all pixels fail alpha test), and maybe going from 2x2 to 4x1 would be better (possible con: maybe worse for tall and thin, which are not that uncommon.)

I'm seeing values that are odd multiples of SCREEN_SCALE_FACTOR. Or do you mean something else?

Hm, at least if it's split into bins, I remembered it being aligned. But maybe it was something I was planning to do or once did and decided wasn't worth it? It has been through various changes.

Well, Cast is also not SIMD-ified. It features prominently in Interpolate.

That's interesting and does look pretty terrible. I do see some improvement from changing that, even with some of our specializations. Will just do a quick pull.

-[Unknown]

hrydgard commented 1 year ago

Doing as many early-exits as possible is probably quite beneficial, it's running on a CPU after all, not a GPU.

fp64 commented 1 year ago

Doing as many early-exits as possible is probably quite beneficial, it's running on a CPU after all, not a GPU.

My intuition is the same, but needs testing, obviously.

Tangentially related, but I'm considering PR to centralize 32-bit x86 workaround along the lines of:

#if PPSSPP_ARCH(X86)
// [Detailed rationale here].
#define SAFE_M128(v)  _mm_loadu_ps   (reinterpret_cast<const float*>  (&(v)))
#define SAFE_M128I(v) _mm_loadu_si128(reinterpret_cast<const __m128i*>(&(v)))
#else // x64, FWIW, also works on non-x86.
#define SAFE_M128(v)  (v)
#define SAFE_M128I(v) (v)
#endif

to eliminate #ifdef's at usage site.

Would that be a good idea?
What header is a good fit for these macros?

Open to the suggestions for better names too.

fp64 commented 1 year ago

In several places division is used instead of a shift. When signed, complier has to insert extra logic to match truncating division, which can hurt perf (unless truncating division is both possible and the desired behaviour). To give just one example, here's GetPixelDataOffset (swizzled): https://godbolt.org/z/37KbTdxnE

hrydgard commented 1 year ago

That looks like it might be a worthwhile improvement, where applicable. Feel free to PR.

unknownbrackets commented 1 year ago

Note that this function, too, is not normally called very often. But it probably should be uint32_t.

-[Unknown]

fp64 commented 1 year ago

x86 with JIT aside, wouldn't it get called quite often on ARM (4 times per SampleLinearLevel)?

fp64 commented 1 year ago

I think this can be simplified:

static inline void GetTextureCoordinatesProj(const VertexData& v0, const VertexData& v1, const float p, float &s, float &t) {
    // This is for texture matrix projection.
    float q0 = 1.f / v0.clipw;
    float q1 = 1.f / v1.clipw;
    float wq0 = p * q0;
    float wq1 = (1.0f - p) * q1;

    float q_recip = 1.0f / (wq0 + wq1);
    float q = (v0.texturecoords.q() * wq0 + v1.texturecoords.q() * wq1) * q_recip;
    q_recip *= 1.0f / q;

    s = (v0.texturecoords.s() * wq0 + v1.texturecoords.s() * wq1) * q_recip;
    t = (v0.texturecoords.t() * wq0 + v1.texturecoords.t() * wq1) * q_recip;
}

Namely, float q_recip = 1.0f / (v0.texturecoords.q() * wq0 + v1.texturecoords.q() * wq1);. This is not so much about speed, the current code just looks a bit weird (unless it is doing this on purpose, to get NaNs right or something).

There's a trick where you calculate 2 reciprocals for price of 1:

// Naive.
float inv_a = 1.0f / a;
float inv_b = 1.0f / b;
// Trick.
float inv_ab=1.0f / (a*b);
float inv_a = inv_ab * b;
float inv_b = inv_ab * a;

This can also be extended to 3 or more reciprocals, with more overhead. Does come with a lot of fine print, though (zeroes botch everything, non-exact where it could be), and might not be worth it anyway (CPUs can run several divisions in parallel).

fp64 commented 1 year ago

Looks like constants only need to be computed once (doubt that compiler hoists them, since it doesn't see const RasterizerState as a guarantee that the contents would not change): https://github.com/hrydgard/ppsspp/blob/42d4b5d41d7a742622262da223c96e9e70dbb8cf/GPU/Software/Rasterizer.cpp#L1126-L1127

unknownbrackets commented 1 year ago

x86 with JIT aside, wouldn't it get called quite often on ARM (4 times per SampleLinearLevel)?

Fair enough.

GetTextureCoordinatesProj

This is probably a case of tweaking something that was slightly wrong and then not fixing it, oops. Yes, should make sense to simplify.

There's a trick where you calculate 2 reciprocals for price of 1:

If w is zero or infnan, generally graphics won't draw anyway on a PSP. That said, unless it realizes a measurable FPS gain, I'd be wary of doing things like that.

Looks like constants only need to be computed once

Yeah, I've tried this and it didn't help much and may have hurt. Maybe it's worth doing, but in this case the code only runs in throughmode (no matrix transforms, used for 2D drawing typically, most often low triangle counts.) And another if per triangle hurts all those tiny hand/finger parts of 3D models, although only very slightly. But at least before, it was such a small difference I couldn't tell it from error with or without.

-[Unknown]

fp64 commented 1 year ago

Hm, IsRightSideOrFlatBottomLine looks a bit strange. Normally (assuming you know the winding order), you only need 2 points (the edge itself) to determine if edge function needs bias to satisfy the fill rules, and no division.

For traditional top-left rule, it's something like this:

// Get bias for oriented edge A->B, assuming (visible) triangles are CCW
// in Cartesian (y - up) coordinates,
static inline int edge_function_bias(const Vec2<int> &A, const Vec2<int> &B)
{
    bool is_top_left=(B.y==A.y&&B.x<A.x)||(B.y<A.y);
    return is_top_left?0:-1;
}

Not sure what fill rules are for PSP, but this feels off (at least the implementation; semantics might be correct).

fp64 commented 1 year ago

Looks like IsRightSideOrFlatBottomLine does not return quite what it claims. Here is the function in it's current form:

static inline bool IsRightSideOrFlatBottomLine(const Vec2<int>& vertex, const Vec2<int>& line1, const Vec2<int>& line2)
{
    if (line1.y == line2.y) {
        // just check if vertex is above us => bottom line parallel to x-axis
        return vertex.y < line1.y;
    } else {
        // check if vertex is on our left => right side
        return vertex.x < line1.x + (line2.x - line1.x) * (vertex.y - line1.y) / (line2.y - line1.y);
    }
}

Consider:

Vec2<int> A( 63, -59);
Vec2<int> B(-49,  73);
Vec2<int> C( 67, -64);
bool result = IsRightSideOrFlatBottomLine(C, A, B); // <-- false.

However, in a (rather thin) ABC we have line1.x + (line2.x - line1.x) * double(vertex.y - line1.y) / (line2.y - line1.y) == 67.24..., so vertex.x is less than that.

Edge AB also happens to pass through (7,7) which TriangleEdge::Start considers a sampling point (pixel center, as SCREEN_SCALE_FACTOR = 16), and so bias here does affect which pixels get covered (I think).

The example for the other winding order is Vec2<int> A( 2, 58); Vec2<int> B( 12, -44); Vec2<int> C( 8, -6); (also passes through (7,7)).

I think what it tries to compute can be written as:

static inline bool IsRightSideOrFlatBottomLine(const Vec2<int>& vertex, const Vec2<int>& line1, const Vec2<int>& line2)
{
    if (line1.y == line2.y) {
        // just check if vertex is above us => bottom line parallel to x-axis
        return vertex.y < line1.y;
    } else {
        // check if vertex is on our left => right side
        int64_t delta = int64_t(vertex.x - line1.x) * int64_t(line2.y - line1.y) - int64_t(line2.x - line1.x) * int64_t(vertex.y - line1.y);
        return line2.y > line1.y ? (delta < 0) : (delta > 0);
    }
}

The delta is, unsurprisingly, just a 2x2 determinant, i.e. double the triangle's oriented area.

fp64 commented 1 year ago

Experiments indicate that discrepancy occurs with about 0.0038% chance for vertices uniformly distributed in (480*16 x 272*16), but much more likely for tiny triangles (>0.5%). The IsRightSideOrFlatBottomLine seems conservative, i.e. it can wrongly return false, but not true. Meaning, there should be no cracks, but potential overdraw, which probably only matters for semi-transparent stuff (and maybe some stencil things).

fp64 commented 1 year ago

The compilers don't care to optimize return mask.x >= 0 || mask.y >= 0 || mask.z >= 0 || mask.w >= 0; (fallback in AnyMask) to branchless return (mask.x & mask.y & mask.z & mask.w) >= 0. Except, funnily enough, on NEON version something vaguely similar seems to happen, contrary to what actual intrinsics state: https://godbolt.org/z/dos4rfrcY . Likely not worth bothering, just thought it amusing.

unknownbrackets commented 1 year ago

Hm, IsRightSideOrFlatBottomLine looks a bit strange.

Yeah, that function is 10 years old and I think came from Dolphin: https://github.com/hrydgard/ppsspp/commit/b0d3848dc74ac44ef705f782f7f54d611354fb95

We used to have much worse coverage issues, but they were caused by other worse problems. I think we probably don't really have a test that directly covers this. The closest we have are tests.py -m gpu/primitives/ or gpu/triangle/triangle. You'll see that gpu/triangle/triangle fails from linear interpolation mainly (it generates __testcompare.png which helps comparing the pixels.)

Should probably create a specific coverage test for each primitive, I guess, and check right to left order, etc.

-[Unknown]

fp64 commented 1 year ago

https://github.com/hrydgard/ppsspp/blob/f3d95a2cef1158b6fd02d8d99d6da1cc84ea3d8e/GPU/Software/SamplerX86.cpp#L1080-L1088 Loading 32 bits at a time does seem consistently faster in SSE2: https://godbolt.org/z/1j75fz9aM

Instruction set	Path	Speed
SSE2	movd	10.4 GB/s
SSE2	pinsrw	8.4 GB/s
SSE4.1	pinsrd	12.6 GB/s
AVX2	vpgatherdd	14.5 GB/s

I sometimes get a case where vpgatherdd is slower than pinsrd-based version. Might be not worth it especially for rare path.

fp64 commented 1 year ago

Just to mention the obvious. Since w0+w1+w2=const, interpolation can be rewritten as w0*A0+w1*A1+w2*A1 = w0*A0+w1*A1+(W-w0-w1)*A2 = w0*(A0-A2)+w1*(A1-A2)+W*A2, where W=w0+w1+w2. This can be a win for constant A0, A1, A2 (since the coefficients can be precomputed; you can also bake in division by W if need be). You can even parametrize by screen X,Y, rather than w0,w1. There may be some accuracy issues though (e.g. w0=1, w1=0, w2=0 does not guarantee A0 output).

unknownbrackets commented 1 year ago

Loading 32 bits at a time does seem consistently faster in SSE2:

Well, wait, we can't use PINSRD in this path - we use that for SSE4. The question here is, would it be better to MOVD from RAM and then blend? Probably only relevant to try that on a pre-SSE4 CPU. So I just left it as two PINSRW.

I sometimes get a case where vpgatherdd is slower than pinsrd-based version.

At least on a Coffee Lake, I had pretty significant gains from using VPGATHERDD in PPSSPP (#15275.) Sure have come a long way since those speeds, heh.

This can be a win for constant A0, A1, A2 (since the coefficients can be precomputed; you can also bake in division by W if need be).

Hm, this sounds like it could be a good idea. Might have to be careful about any accuracy impacts as you mention, although since we go at half pixels I don't know how often any weight lands directly on zero. But of course, it can happen.

-[Unknown]

fp64 commented 1 year ago

The path named movd in the table (load32x4_movd in godbolt example) does not use PINSRD, and looks about 20% faster than PINSRW-based version there.

Probably only relevant to try that on a pre-SSE4 CPU.

On my (pre-SSE4) machine (adapted the same code from godbolt):

movd      : 1.509 GB/s
pinsrw    : 1.025 GB/s

That said, loading to temporary in mem may be more hassle in JIT, so its understandable to not bother with it for (probably) rarely used path.

On a different note, out of curiosity, I tried

LIBGL_ALWAYS_SOFTWARE=true ./PPSSPPSDL

and it was about twice as fast as PPSSPP's SW rendering (14 FPS, vs 7; though still lot slower than OpenGL which runs at full 60 FPS) in 3D scene in Soulcalibur. Of course, it is not a fair comparison, since SW renderer does some things that OpenGL simply doesn't (e.g. dithering, though that one is probably relatively cheap). Yet, this suggests there is still room for growth.

On windows, to test this, you can simply place Mesa3D's opengl32.dll next to PPSSPP's executable (this worked fine when I tried it). I think you can get the dll from e.g. https://github.com/pal1000/mesa-dist-win or https://download.qt.io/development_releases/prebuilt/llvmpipe/windows/ (I haven't tested these specific ones).

hrydgard commented 1 year ago

Running our hardware renderers in an outside software rasterizer is "cheating" in a number of ways - they will benefit from pre-decoded cached textures, for example, while our software renderer decodes on every access, and will inherit all the other inaccuracies of the hardware renderers, of course.

hrydgard / ppsspp