Old 3DS Performance - Githubissues

CarlosEFML commented 4 years ago

Is it possible to port gd_math.c to use VFP on the 3DS? This is an example code that I used to use in my homebrews:

.global multMatrix44FPU
multMatrix44FPU:
    VPUSH {d8-d11}
    VLDMIA r1 !, {s16-s19} // Load 1st line of m2 -> [0 1 2 3]
    VLDR.F32 s20, [r0, # 16 * 0 + 0 * 4] // Load 1st col of m1 -> g
    VLDR.F32 s21, [r0, # 16 * 1 + 0 * 4] // -> m
    VLDR.F32 s22, [r0, # 16 * 2 + 0 * 4] // -> s
    VLDR.F32 s23, [r0, # 16 * 3 + 0 * 4] // -> w
    VMUL.F32 s0, s20, s16 // = {g * 0}
    VMUL.F32 s1, s20, s17 // = {g * 1}
    VMUL.F32 s2, s20, s18 // = {g * 2}
    VMUL.F32 s3, s20, s19 // = {g * 3}
    VMUL.F32 s4, s21, s16 // = {m * 0}
    VMUL.F32 s5, s21, s17 // = {m * 1}
    VMUL.F32 s6, s21, s18 // = {m * 2}
    VMUL.F32 s7, s21, s19 // = {m * 3}
    VMUL.F32 s8, s22, s16 // = {s * 0}
    VMUL.F32 s9, s22, s17 // = {s * 1}
    VMUL.F32 s10, s22, s18 // = {s * 2}
    VMUL.F32 s11, s22, s19 // = {s * 3}
    VMUL.F32 s12, s23, s16 // = {w * 0}
    VMUL.F32 s13, s23, s17 // = {w * 1}
    VMUL.F32 s14, s23, s18 // = {w * 2}
    VMUL.F32 s15, s23, s19 // = {w * 3}
    VLDMIA r1 !, {s16-s19} // Load 2nd line of m2 -> [4 5 6 7]
    VLDR.F32 s20, [r0, # 16 * 0 + 1 * 4] // Load 2nd col of m1 -> h
    VLDR.F32 s21, [r0, # 16 * 1 + 1 * 4] // -> n
    VLDR.F32 s22, [r0, # 16 * 2 + 1 * 4] // -> t
    VLDR.F32 s23, [r0, # 16 * 3 + 1 * 4] // -> x
    VMLA.F32 s0, s20, s16 // = {g * 0} + {h * 4}
    VMLA.F32 s1, s20, s17 // = {g * 1} + {h * 5}
    VMLA.F32 s2, s20, s18 // = {g * 2} + {h * 6}
    VMLA.F32 s3, s20, s19 // = {g * 3} + {h * 7}
    VMLA.F32 s4, s21, s16 // = {m * 0} + {n * 4}
    VMLA.F32 s5, s21, s17 // = {m * 1} + {n * 5}
    VMLA.F32 s6, s21, s18 // = {m * 2} + {n * 6}
    VMLA.F32 s7, s21, s19 // = {m * 3} + {n * 7}
    VMLA.F32 s8, s22, s16 // = {s * 0} + {t * 4}
    VMLA.F32 s9, s22, s17 // = {s * 1} + {t * 5}
    VMLA.F32 s10, s22, s18 // = {s * 2} + {t * 6}
    VMLA.F32 s11, s22, s19 // = {s * 3} + {t * 7}
    VMLA.F32 s12, s23, s16 // = {w * 0} + {x * 4}
    VMLA.F32 s13, s23, s17 // = {w * 1} + {x * 5}
    VMLA.F32 s14, s23, s18 // = {w * 2} + {x * 6}
    VMLA.F32 s15, s23, s19 // = {w * 3} + {x * 7}
    VLDMIA r1 !, {s16-s19} // Load 3rd line of m2 -> [8 9 A B]
    VLDR.F32 s20, [r0, # 16 * 0 + 2 * 4] // Load 3rd col of m1 -> i
    VLDR.F32 s21, [r0, # 16 * 1 + 2 * 4] // -> o
    VLDR.F32 s22, [r0, # 16 * 2 + 2 * 4] // -> u
    VLDR.F32 s23, [r0, # 16 * 3 + 2 * 4] // -> y
    VMLA.F32 s0, s20, s16 // = {g * 0} + {h * 4} + {i * 8}
    VMLA.F32 s1, s20, s17 // = {g * 1} + {h * 5} + {i * 9}
    VMLA.F32 s2, s20, s18 // = {g * 2} + {h * 6} + {i * A}
    VMLA.F32 s3, s20, s19 // = {g * 3} + {h * 7} + {i * B}
    VMLA.F32 s4, s21, s16 // = {m * 0} + {n * 4} + {o * 8}
    VMLA.F32 s5, s21, s17 // = {m * 1} + {n * 5} + {o * 9}
    VMLA.F32 s6, s21, s18 // = {m * 2} + {n * 6} + {o * A}
    VMLA.F32 s7, s21, s19 // = {m * 3} + {n * 7} + {o * B}
    VMLA.F32 s8, s22, s16 // = {s * 0} + {t * 4} + {u * 8}
    VMLA.F32 s9, s22, s17 // = {s * 1} + {t * 5} + {u * 9}
    VMLA.F32 s10, s22, s18 // = {s * 2} + {t * 6} + {u * A}
    VMLA.F32 s11, s22, s19 // = {s * 3} + {t * 7} + {u * B}
    VMLA.F32 s12, s23, s16 // = {w * 0} + {x * 4} + {y * 8}
    VMLA.F32 s13, s23, s17 // = {w * 1} + {x * 5} + {y * 9}
    VMLA.F32 s14, s23, s18 // = {w * 2} + {x * 6} + {y * A}
    VMLA.F32 s15, s23, s19 // = {w * 3} + {x * 7} + {y * B}
    VLDMIA r1, {s16-s19} // Load 4th line of m2 -> [C D E F]
    VLDR.F32 s20, [r0, # 16 * 0 + 3 * 4] // Load 4th col of m1 -> j
    VLDR.F32 s21, [r0, # 16 * 1 + 3 * 4] // -> p
    VLDR.F32 s22, [r0, # 16 * 2 + 3 * 4] // -> v
    VLDR.F32 s23, [r0, # 16 * 3 + 3 * 4] // -> z
    VMLA.F32 s0, s20, s16 // = {g * 0} + {h * 4} + {i * 8} + {j * C}
    VMLA.F32 s1, s20, s17 // = {g * 1} + {h * 5} + {i * 9} + {j * D}
    VMLA.F32 s2, s20, s18 // = {g * 2} + {h * 6} + {i * A} + {j * E}
    VMLA.F32 s3, s20, s19 // = {g * 3} + {h * 7} + {i * B} + {j * F}
    VMLA.F32 s4, s21, s16 // = {m * 0} + {n * 4} + {o * 8} + {p * C}
    VMLA.F32 s5, s21, s17 // = {m * 1} + {n * 5} + {o * 9} + {p * D}
    VMLA.F32 s6, s21, s18 // = {m * 2} + {n * 6} + {o * A} + {p * E}
    VMLA.F32 s7, s21, s19 // = {m * 3} + {n * 7} + {o * B} + {p * F}
    VMLA.F32 s8, s22, s16 // = {s * 0} + {t * 4} + {u * 8} + {v * C}
    VMLA.F32 s9, s22, s17 // = {s * 1} + {t * 5} + {u * 9} + {v * D}
    VMLA.F32 s10, s22, s18 // = {s * 2} + {t * 6} + {u * A} + {v * E}
    VMLA.F32 s11, s22, s19 // = {s * 3} + {t * 7} + {u * B} + {v * F}
    VMLA.F32 s12, s23, s16 // = {w * 0} + {x * 4} + {y * 8} + {z * C}
    VMLA.F32 s13, s23, s17 // = {w * 1} + {x * 5} + {y * 9} + {z * D}
    VMLA.F32 s14, s23, s18 // = {w * 2} + {x * 6} + {y * A} + {z * E}
    VMLA.F32 s15, s23, s19 // = {w * 3} + {x * 7} + {y * B} + {z * F}
    VPOP {d8-d11}
    VSTMIA r2, {s0-s15}
    BX lr

mkst commented 4 years ago

I think a big benefit would come from moving the functions gfx_sp_vertex and/or gfx_sp_tri1 from https://github.com/sm64-port/sm64_3ds/blob/master/src/pc/gfx/gfx_pc.c to run on the gpu via vertex shader(s).

I tried simplifying the draw_triangles calls to use ~10 (slightly) different shaders and simply memcpy-ing the vbo_buf, but it didn't do anything for performance. Perhaps slightly worse if anything.

Do you have the source for the vpu stuff?

Edit tired brain was conflating VPU/GPU. I've switched out the gfx_matrix_mul function for that assembly on my branch, and whilst it runs, it did not give an obvious performance improvement.

That said, do you have additional functionality implemented in assembly? Basically:

    float x = v->ob[0] * rsp.MP_matrix[0][0] + v->ob[1] * rsp.MP_matrix[1][0] + v->ob[2] * rsp.MP_matrix[2][0] + rsp.MP_matrix[3][0];
    float y = v->ob[0] * rsp.MP_matrix[0][1] + v->ob[1] * rsp.MP_matrix[1][1] + v->ob[2] * rsp.MP_matrix[2][1] + rsp.MP_matrix[3][1];
    float z = v->ob[0] * rsp.MP_matrix[0][2] + v->ob[1] * rsp.MP_matrix[1][2] + v->ob[2] * rsp.MP_matrix[2][2] + rsp.MP_matrix[3][2];
    float w = v->ob[0] * rsp.MP_matrix[0][3] + v->ob[1] * rsp.MP_matrix[1][3] + v->ob[2] * rsp.MP_matrix[2][3] + rsp.MP_matrix[3][3];

which is DP4 i think?

and these two would be good:

static inline void gfx_normalize_vector(float v[3]) {
    float s = sqrtf(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
    v[0] /= s;
    v[1] /= s;
    v[2] /= s;
}

static inline void gfx_transposed_matrix_mul(float res[3], const float a[3], const float b[4][4]) {
    res[0] = a[0] * b[0][0] + a[1] * b[0][1] + a[2] * b[0][2];
    res[1] = a[0] * b[1][0] + a[1] * b[1][1] + a[2] * b[1][2];
    res[2] = a[0] * b[2][0] + a[1] * b[2][1] + a[2] * b[2][2];
}

mkst commented 4 years ago

@CarlosEFML I've just managed to get audio moved across to the 2nd CPU and it's drastically improved performance. It's by no means perfect, but the slowdowns during winged-cap / surfing turtle etc are significantly reduced.

CarlosEFML commented 4 years ago

@mkst Good job!!! I'll try to port gfx_normalize_vector and gfx_transposed_matrix_mul to VFP ASM.

mkst commented 4 years ago

@mkst Good job!!! I'll try to port gfx_normalize_vector and gfx_transposed_matrix_mul to VFP ASM.

The first piece of code (generating x, y, z, w) is called for every vertex (see gfx_sp_vertex) so I think that would be a good target.

Are you writing the ASM by hand, or creating a simple function and then compiling with fpu=neon? or some other method? In any case, I look forward to the results!

CarlosEFML commented 4 years ago

I'm writing by hand and it's been a long time since my last line of code for 3DS. But you're right, I will try to optimize gfx_sp_vertex first.

abelol954 commented 4 years ago

they have a downloadable version in .cia

CarlosEFML commented 4 years ago

Bad news. I did not notice any improvement with the use of VFP. Maybe because of the function call overhead (I couldn't make the asm inline code work). I created a PR with the code in case you want to test.

Unfortunately, 3DS only supports scalar operations and does not support vector operations like this: VMUL.F32 q0, q1, q2

nintendo-3ds-hardware-thread

mkst commented 4 years ago

Bad news. I did not notice any improvement with the use of VFP. Maybe because of the function call overhead (I couldn't make the asm inline code work). I created a PR with the code in case you want to test.

Unfortunately, 3DS only supports scalar operations and does not support vector operations like this: VMUL.F32 q0, q1, q2

nintendo-3ds-hardware-thread

How did you test? Were you noting CPU usage before/after or just "feeling it out"?

Also, have you tried my fork (based off Gericom's) with audio running on the 2nd CPU?

CarlosEFML commented 4 years ago

Yes, just feelings. But now I have applied this code to your fork and the result is the same. It seems that audio is the real villain in this port. When disabling the audio, we have a solid 30 fps on the O3DS.

mkst commented 4 years ago

@CarlosEFML do you not get a smooth(er) experience with audio on the OS core?

Can any of the audio code (mixer.c) be accelerated by the VPU? There are already SSE4.1 and NEON optimisations for the vanilla port, but the poor CPU in the 3DS doesnt have NEON.

Also, I took your transfByMatrix44FPU code and used it in gfx_pc.c so it would be more generic than the gd_maths.c which is only used for the intro Mario head. That said I've just added transfByMatrix44FPU too (locally) and didn't have a noticeable impact either sadly.

CarlosEFML commented 4 years ago

I tested the fork with the audio processing on CPU=1, but only with .3dsx and didn't notice any improvement.

mkst commented 4 years ago

I tested the fork with the audio processing on CPU=1, but only with .3dsx and didn't notice any improvement.

Interesting, are you running a new(ish) version of Luma (i.e. 10.1 or above)? Might be that the call to request 80% of CPU1 fails so the thread is started on the app core (CPU0) instead - which would have no improvement, potentially worse performance.

If you go to the first level (BOB) and get the winged cap, the game should not slow down to half-speed like it does normally.

If you really want to test, you could checkout the 3ds-shaders fork as the bottom screen is a console , it should say tell you which CPU is used for audio:

printf("Created audio thread on core %i\n", cpu);

I might put in a while loop to set the OS time to as high as it can before it fails (was 30% on older version of luma).

BTW, feel free to join the Discord if you'd prefer some real-time communication!

sm64-port / sm64_3ds

Old 3DS Performance #28