[WebGL] Add OpenGL Fragment Shader Backend

taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.

https://taichi-lang.org

Apache License 2.0

25.51k stars 2.28k forks source link

[WebGL] Add OpenGL Fragment Shader Backend #1421

Closed archibate closed 3 years ago

archibate commented 4 years ago

Concisely describe the proposed feature I would like to add a OpenGL fragment shader backend so that my poor laptop NV card could get utilized:

root@archlinux ~/taichi (git)-[cc3] # lspci | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GT216M [GeForce GT 330M] (rev a2)
root@archlinux ~/taichi (git)-[cc3] # glewinfo | grep OpenGL
OpenGL version 3.3 (Compatibility Profile) Mesa 20.1.2 is supported

Currently we only support OpenGL compute shader backend, which requires OpenGL 4.3+. But it would be also useful to add a fragment shader that helps:

Mac users who found Metal backend extremely slow;
Users with a very old card that doesn't support OpenGL 4.3+;
Make it possible to run Taichi on WebGL.

Describe the solution you'd like (if any) Cons: In fact, fragment shaders even don't support atomic operations... not sure if it's still possible to play it in a GPGPU way. And, there's already an OpenGL compute shader backend, not sure if it's still profitable to add a fragment shader backend with just some poor support..

Additional comments @yuanming-hu @k-ye Do you think this is profitable? If so, what's the priority? If not, feel free to close me without a reason.

k-ye commented 4 years ago

IIRC, we had a discussion in the very beginning for the OpenGL backend. The conclusion was that the limits of the fragment shader, some of which you've pointed out, probably render it impossible to implement Taichi with pre-4.3 OpenGLs.. (I do see that Halide supports OpenGL without compute shader, but maybe its functional computation pattern doesn't rely so much on atomic ops or strong memory order. OTOH, Taichi is designed to handle mega kernels that has a much richer semantics.)

Mac users who found Metal backend extremely slow;

This is indeed a problem. Fortunately, I think we've identified a poor usage of Metal's memory model, which is going to be fixed in #1415. By having managed memory storage mode + less global float atomics, I'd hope Metal to have some sizable performance boost. E.g. the Zhihu example for calculating PI now runs at ~0.01s (excluding the first run, because it does the JIT..)

Make it possible to run Taichi on WebGL.

IMO this is definitely an exciting path... I found that WebGL 2.0 claims to support compute shaders (e.g. https://github.com/9ballsyndrome/WebGL_Compute_shader), but I guess it's still very early stage and there could be lots of pitfalls were we to try this path now. Maybe the wise thing is to wait for it to become more mature...

what's the priority?

I think many people have already mentioned that running Taichi in a browser will be very awesome.. Just one random idea, C/LLVM -> WASM?

yuanming-hu commented 4 years ago

Thanks for proposing this. I don't think it will be easy as mentioned by @k-ye: fragment shaders have a very limited computational capability.

WebGL 2.0 is still premature but I think LLVM->WASM/JS sounds a reasonable solution to run Taichi in browsers.

archibate commented 4 years ago

Btw, what do we mean by run Taichi in browsers? Does it mean we can run compiled javascript in browser, or the Taichi python frontend on browser?

yuanming-hu commented 4 years ago

Just run the compiled javascript. Basically it's a "player" of pre-compiled Taichi kernels (in JS/WASM).

archibate commented 4 years ago

I think it's still good&possible to have a FS backend even if atomic is not supported, we can have ti.extensions.atomic on that case. Some tweaks could be applied to make mpm88 functional on non-atomic backends like OpenGL FS.

archibate commented 4 years ago

Case 1:

for i in x:
  x[i] += v[i] * dt

Since all atomic dest are independent, no read, no overlap (i), we can actually demote this atomic operation in Taichi middle-end.

Case 2:

for i in x:
  p = int(x[i] * inv_dx)
  grid_v[p] += v[i]

despite the dest location is non-trivial, possible overlapped n-times, but there are no read on grid_v during this offload. So instead of accumate x to grid_v, we collect grid_v from x:

for p in ti.grouped(grid_v):
  r = 0.0
  for i in range(N):
    if x[i] * inv_dx == p:  # not sure if there're better method to quick among in space
      r += v[i]
  grid_v[i] = r

not sure if this is possible to let the middle-end to do the transform... though.