Open CarlosEFML opened 4 years ago
I think a big benefit would come from moving the functions gfx_sp_vertex and/or gfx_sp_tri1 from https://github.com/sm64-port/sm64_3ds/blob/master/src/pc/gfx/gfx_pc.c to run on the gpu via vertex shader(s).
I tried simplifying the draw_triangles calls to use ~10 (slightly) different shaders and simply memcpy-ing the vbo_buf, but it didn't do anything for performance. Perhaps slightly worse if anything.
Do you have the source for the vpu stuff?
Edit tired brain was conflating VPU/GPU. I've switched out the gfx_matrix_mul
function for that assembly on my branch, and whilst it runs, it did not give an obvious performance improvement.
That said, do you have additional functionality implemented in assembly? Basically:
float x = v->ob[0] * rsp.MP_matrix[0][0] + v->ob[1] * rsp.MP_matrix[1][0] + v->ob[2] * rsp.MP_matrix[2][0] + rsp.MP_matrix[3][0];
float y = v->ob[0] * rsp.MP_matrix[0][1] + v->ob[1] * rsp.MP_matrix[1][1] + v->ob[2] * rsp.MP_matrix[2][1] + rsp.MP_matrix[3][1];
float z = v->ob[0] * rsp.MP_matrix[0][2] + v->ob[1] * rsp.MP_matrix[1][2] + v->ob[2] * rsp.MP_matrix[2][2] + rsp.MP_matrix[3][2];
float w = v->ob[0] * rsp.MP_matrix[0][3] + v->ob[1] * rsp.MP_matrix[1][3] + v->ob[2] * rsp.MP_matrix[2][3] + rsp.MP_matrix[3][3];
which is DP4 i think?
and these two would be good:
static inline void gfx_normalize_vector(float v[3]) {
float s = sqrtf(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
v[0] /= s;
v[1] /= s;
v[2] /= s;
}
static inline void gfx_transposed_matrix_mul(float res[3], const float a[3], const float b[4][4]) {
res[0] = a[0] * b[0][0] + a[1] * b[0][1] + a[2] * b[0][2];
res[1] = a[0] * b[1][0] + a[1] * b[1][1] + a[2] * b[1][2];
res[2] = a[0] * b[2][0] + a[1] * b[2][1] + a[2] * b[2][2];
}
@CarlosEFML I've just managed to get audio moved across to the 2nd CPU and it's drastically improved performance. It's by no means perfect, but the slowdowns during winged-cap / surfing turtle etc are significantly reduced.
@mkst Good job!!! I'll try to port gfx_normalize_vector and gfx_transposed_matrix_mul to VFP ASM.
@mkst Good job!!! I'll try to port gfx_normalize_vector and gfx_transposed_matrix_mul to VFP ASM.
The first piece of code (generating x, y, z, w) is called for every vertex (see gfx_sp_vertex
) so I think that would be a good target.
Are you writing the ASM by hand, or creating a simple function and then compiling with fpu=neon? or some other method? In any case, I look forward to the results!
I'm writing by hand and it's been a long time since my last line of code for 3DS. But you're right, I will try to optimize gfx_sp_vertex first.
they have a downloadable version in .cia
Bad news. I did not notice any improvement with the use of VFP. Maybe because of the function call overhead (I couldn't make the asm inline code work). I created a PR with the code in case you want to test.
Unfortunately, 3DS only supports scalar operations and does not support vector operations like this: VMUL.F32 q0, q1, q2
Bad news. I did not notice any improvement with the use of VFP. Maybe because of the function call overhead (I couldn't make the asm inline code work). I created a PR with the code in case you want to test.
Unfortunately, 3DS only supports scalar operations and does not support vector operations like this: VMUL.F32 q0, q1, q2
How did you test? Were you noting CPU usage before/after or just "feeling it out"?
Also, have you tried my fork (based off Gericom's) with audio running on the 2nd CPU?
Yes, just feelings. But now I have applied this code to your fork and the result is the same. It seems that audio is the real villain in this port. When disabling the audio, we have a solid 30 fps on the O3DS.
@CarlosEFML do you not get a smooth(er) experience with audio on the OS core?
Can any of the audio code (mixer.c) be accelerated by the VPU? There are already SSE4.1 and NEON optimisations for the vanilla port, but the poor CPU in the 3DS doesnt have NEON.
Also, I took your transfByMatrix44FPU
code and used it in gfx_pc.c so it would be more generic than the gd_maths.c which is only used for the intro Mario head. That said I've just added transfByMatrix44FPU
too (locally) and didn't have a noticeable impact either sadly.
I tested the fork with the audio processing on CPU=1, but only with .3dsx and didn't notice any improvement.
I tested the fork with the audio processing on CPU=1, but only with .3dsx and didn't notice any improvement.
Interesting, are you running a new(ish) version of Luma (i.e. 10.1 or above)? Might be that the call to request 80% of CPU1 fails so the thread is started on the app core (CPU0) instead - which would have no improvement, potentially worse performance.
If you go to the first level (BOB) and get the winged cap, the game should not slow down to half-speed like it does normally.
If you really want to test, you could checkout the 3ds-shaders fork as the bottom screen is a console , it should say tell you which CPU is used for audio:
printf("Created audio thread on core %i\n", cpu);
I might put in a while loop to set the OS time to as high as it can before it fails (was 30% on older version of luma).
BTW, feel free to join the Discord if you'd prefer some real-time communication!
Is it possible to port gd_math.c to use VFP on the 3DS? This is an example code that I used to use in my homebrews: