Multithreaded Rasterization?

nickwanninger commented 2 years ago

Hi! I just want to say that this project is excellent, and I plan on getting it running in my kernel as soon as I get the time

While it's quite easy for a hobby kernel to take advantage of multiple CPU cores, implementing (or porting) a GPU driver is quite the sisyphean task. I was wondering if there are plans to enable parallelized rasterization in a similar way to how GPUs do so, but in software w/ multiple CPU threads - pthreads, for example. I'm not super familiar with the codebase here, but would that be a considerable undertaking/refactor of the core renderer? I'd assume this feature would be hidden behind an opt-in flag to maintain support for systems without multiple cores, or even the pthread interface.

rswinkle commented 2 years ago

Thanks, it's always nice to hear people want to use your project. Let me know if you run into any problems.

As far as multithreading, I have no plans for it. The code is very single threaded and it would unfortunately take considerable refactoring to change the pipeline to handle multithreading in that way. And I'm not sure the performance gains would be enough to be worth the sacrifice in code simplicity and readability. This is especially true on lower powered devices that sometimes still have only 1 or 2 cores, maybe 4 at most and you're far more likely to be limited by other factors anyway.

I have always thought that pglDrawFrame would be great for OpenMP though. That would be extremely easy and not touch any of the regular pipeline.

Even if I were to change my mind at some point it would be after getting to "1.0" which requires some kind of proper render to texture capability, and anything else I'm missing to be able to port the rest of LearnOpenGL. Since that requires larger blocks of dedicated time and focus, that probably won't happen for a while, and probably not without some kind of funding. Maybe if I was wildly successful multiththreading could be a stretch goal or something.

EDIT: clarification. Also I should add that there's also a big difference between threading vertices and fragments. The GPU does both. I would probably never do the former, but the latter is possible. I'd still have to do a lot of thinking and testing, and it would probably be OpenMP as well like this updated TinyGL (though I haven't looked at in detail).

rswinkle commented 2 years ago

Also I couldn't use pthreads. It's not called PortableGL for nothing. I'd probably have to use something like TinyCThread

rswinkle commented 2 years ago

I just pushed 2 commits that add OpenMP support for pglDrawFrame and draw_triangle_fill. As I suspected, the former is definitely worth it, and I get much faster times with the shadertoy demo. The latter, with the few things I enabled it on, not so much, maybe 30% faster. I'll have to do more research but I'll leave it in for now.

RicoP commented 1 year ago

I think it could be relatively simple to enable users to allow for multi threaded rendering using pgl.

I would propose a simple API change where for every glXXX(param1, param2, paramN) function there will be a pglXXX(glContext* c, param1, param2, paramN) function with an explicit glContext parameter.

Then all glXXX calls will be simple wrappers calling the corresponding pglXXX function with the global context parameter.

So instead of

static glContext *c;

void glDepthMask(GLboolean flag)
{
    c->depth_mask = flag;
}

We would have

void pglDepthMask(glContext *c, GLboolean flag)
{
    c->depth_mask = flag;
}

static glContext *g_ctx, 
void glDepthMask(GLboolean flag)
{
    pglDepthMask(g_ctx, flag);
}

This would enable users to create several independent glContexts that can be controlled independently in different threads. Then several worker threads could iterate over an object pool and render every object in their unique Renderbuffer.

Then in an last step the different RenderBuffers can be merged into one by simply comparing the individual depth values and picking the closest color fragment.

rswinkle commented 1 year ago

There are some issues with what you describe.

First, I already use the pgl suffix for the non-standard functions that PGL uses. I'd have to come up with something else to differentiate.

Second, the merge would be more complicated than just looking at the depth, and the order of draw calls and access to unified output buffers matters when you're turning modifying the state of blending/depth/stencil etc. I'm pretty sure what you're describing would be difficult if not impossible to implement while obeying the spec. It would only work correctly for the simplest cases and it would not be worth the overhead in code complexity or memory. I'm not even sure it would perform better.

There are ways of parallelizing the pipeline but they would require pretty significant refactoring of the internals and again, an increase in complexity and decrease in readability. If I ever do make it properly parallel it will be long after I've finished/perfected the single threaded version and only if and to the extent that the performance gain is worth the added complexity.

rswinkle / PortableGL

Multithreaded Rasterization? #6