[webgl] Removal of GPGPU example

dleeftink commented 2 weeks ago

Just wondering why this code example (or the GPGPU class in general) has been removed from the current build:

https://github.com/thi-ng/umbrella/blob/161b4f8afaef0df742a8e2c7776993b828662589/examples/webgl-gpgpu-basics/src/index.ts

This seems a very useful abstraction (especially the jobrunner), but I can understand if it has since been superseded by a newer version. If that is these case, could you point me to a recent example that demonstrates GPGPU capabilities?

postspectacular commented 2 weeks ago

Fret not, @dleeftink! The feature is still there (and actually expanded in scope). Here're a few examples showing the new approach (fundamentally the same, just reading the results from FBOs/textures back is currently a out-of-scope and must be handled separately (see below):

Demo	Source
https://demo.thi.ng/umbrella/webgl-game-of-life/	https://github.com/thi-ng/umbrella/blob/develop/examples/webgl-game-of-life/src/index.ts
https://demo.thi.ng/umbrella/webgl-texture-paint/	https://github.com/thi-ng/umbrella/blob/develop/examples/webgl-texture-paint/src/index.ts
https://demo.thi.ng/umbrella/webgl-multipass/	https://github.com/thi-ng/umbrella/blob/develop/examples/webgl-multipass/src/index.ts
https://demo.thi.ng/umbrella/webgl-float-fbo/	https://github.com/thi-ng/umbrella/blob/develop/examples/webgl-float-fbo/src/index.ts

For an example of how to read back a texture, hava a look here:

https://github.com/thi-ng/umbrella/blob/86e43da88d87fa291f32ce4d3ee2ca06ae0723a4/examples/webgl-texture-paint/src/index.ts#L160-L189

Hth! :)

dleeftink commented 2 weeks ago

Thank you for pointing me to these, and the webgl layer in general!

In regards to IO via readPixels, I wonder whether using Dataview + transform feedback would prove more performant for data IO? The use case I am envisioning is parallel reduction on the GPU, and out of many Webgl packages I've tested, not many (if any) provide parallel reduction out of the box. Would be cool to try and implement this using the @thi.ng/webgl package.

For instance, here's an older example using regl:

https://github.com/regl-project/regl/blob/gh-pages/example/reduction.js

postspectacular commented 2 weeks ago

hi @dleeftink - this has tickled my interest and I've just uploaded a new example showing a version of this kind of reduction (using thi.ng/webgl & thi.ng/shader-ast):

Demo: https://demo.thi.ng/umbrella/gpgpu-reduce/

Source code: https://github.com/thi-ng/umbrella/blob/develop/examples/gpgpu-reduce/src/index.ts

Readme w/ benchmark results: https://github.com/thi-ng/umbrella/tree/develop/examples/gpgpu-reduce

Ps. Can you please explain your "DataView w/ transform feedback" comment/approach? Not sure how this fits into this picture here... 😉

dleeftink commented 2 weeks ago

Thank you for taking the time, will have a bit of a play around to see how regl/@thi.ng-webgl compare in terms of setting up a parallel reduction pipeline. The performance profile of the example you provided seems promising enough!

Re; the dataview/transform feedback approach: see this cell on Observable which is part of a notebook where I compare various GPGPU libraries. Beware that you may have to hit the 'profile' button after all libraries have run once to get a proper comparison (I will have to clean up the examples some more, isolate the gl contexts and add proper cooldowns and disposal between runs).

In any case, the approach in the linked cell uses the forked WebGP library to write a sizeable array to an array of N by N textures, which are processed using a transform feedback mechanism. Instead of readPixels, you can construct a TypedArray directly from the result buffer, after which the max is found on the CPU and pushed to an output array. Here, reduction is thus applied in CPU- rather than GPU land, but afaik the transform feedback mechanism allows you to run multiple passes on the bound buffer before passing the data back to the CPU to reduce to a final result.

Both the 'subgpu' and 'supgpu' cells show that quite a fast turnaround can be achieved from the CPU > GPU > CPU using this approach. My thinking is to do as much work on the GPU (e.g. map/reduce) before handing over a relatively small array to the CPU to apply final processing (e.g., deriving an array of centroids from a large N x N matrix).

The relevant lines in the WebGP source:

https://github.com/glennirwin/webgp/blob/d6139188401fdce7379d83c705b77a105b0dfbe8/src/webgp.js#L393-L407

postspectacular commented 2 weeks ago

Thanks for this, I only just now realized that you were talking about the built-in WebGL2 transform feedback feature, whereas I previously assumed you meant some generic/custom mechanism 🤦 Alas, I have not yet had a need to use this feature and so also don't have any direct experience with it...

So I'm still trying to wrap my head around how this would be working here & I know I'll have to do more reading about this (and reading some of the code you linked to)... From the little understanding I have about createTransformFeedback(), my hunch is that you're proposing to perform all GPU processing in the vertex shader stage, rather than in the fragment shader as I've been doing so far? And then you'd use getBufferData() on a still bound vertex array instead of readPixels() on an FBO? Hmmm, if that is what you're talking about, then the approach would involve a major restructuring (or really a full rewrite) ... 🤔

FWIW for texture sizes upto 64x64 (aka up to 4096 result values) the process of binding and reading a FBO takes on my M1 0.04-0.09ms (avg of 1000 iterations)... I'd say that's absolutely acceptable in relation to main computation time...

dleeftink commented 2 weeks ago

Yes, you summarised it better than I did. If major restructuring is required, then please ignore! I think the example is more than serviceable to demonstrate an important GPGPU use case.

Re; transform feedback, the following gist provides a concise example: https://gist.github.com/CodyJasonBennett/34c36b91719171c45ec50e850dc38a34

Although I haven't been able to get instancing to work fully yet for the above gist, this would theoretically allow you to achieve even greater speed-ups as described here: https://webgl2fundamentals.org/webgl/lessons/webgl-instanced-drawing.html

Beyond that, I do see value in providing a Dataview to a vertex buffer, as there is less copying involved. For 4096 values the difference might be negligible compared to readPixels(), but I am investigating how to quickly process 2**24/31 array elements on a first pass with an optional reduction on a second pass (TextEncoder() on CPU -> BytePair encoding on GPU -> BytePair frequency tables on GPU -> Sorting frequency tables on CPU).

Shader-ast seems of great help to implement this functionality.

postspectacular commented 2 weeks ago

Again, thank you! I will try to take a look at these links over the weekend. Just some side notes here:

Instancing is fully supported by thi.ng/webgl, also (individually configurable) for passes in a multi-pass pipeline. You can find some instancing examples here (still have to extract some more interesting ones from other projects):
- https://github.com/thi-ng/umbrella/blob/develop/examples/webgl-grid/
- https://github.com/thi-ng/umbrella/blob/develop/examples/webgl-ssao/
Re: 4096 values: with that I mean that if it's a proper (full) reduction, then you'd only ever have to read a single pixel (vec4), all other data would stay on the GPU and wouldn't have to be read back. The 4096 comes from the example I built earlier today, where I'm also reading out all the intermediate textures. But I'll try do some experiments with that other approach too... 👍

thi-ng / umbrella

[webgl] Removal of GPGPU example #478