gpuweb / gpuweb

Where the GPU for the Web work happens!
http://webgpu.io
Other
4.68k stars 308 forks source link

WASM multi-threading ability #476

Open kvark opened 4 years ago

kvark commented 4 years ago

This issue is closely related to #354 but approached from the WASM angle. Edit: this is an investigation by someone who doesn't have a lot of JS/WASM experience, please take with the grain of salt, and provide your corrections.

Problem statement

In JS, in order to pass a serializable object from one worker to another, one would postMessage() in the sender and then receiveMessage in the receiver. The message will be added to the queue of the receiving worker and processed after it finishes processing the current frame as well as all other queued messages.

This model may be sufficient in (and is natural to) JS applications. In programs compiled for the Web via WASM, however, once there is a value on one thread, any other thread can access it. This implicit sharing is supported by Vulkan, D3D12, Metal, and is generally what our future users coming from the native development would expect.

The problem is - there is no place/hook to insert the message JS glue in this case.

Use cases

One of the use cases would be having a "streaming" thread that loads in some level resources, creates WebGPU objects from them (buffers, textures, individual mipmap levels, etc), which are then used by the rendering thread as soon as they come.

Another, more general example, is having multiple threads processing some sort of a render graph and building different display lists: one for shadow rendering, one for the main screen, there is a room to construct render bundles for anything, etc.

Solution proposals

Asynchronous API

One option is to force the users to be aware of the JS workers event loops and do all the asynchronous message passing the same way it's done in JS. This is least convenient option and may require architectural re-design of the client software.

Synchronous receive

If there was a way to receive a message synchronously (without waiting for the end of the stack frame), we could have some sort of a synchronous native API that handles the transition, e.g.

// on the producer thread:
auto buffer = device.createBuffer(...);
auto sharedBuffer = wgpuShare(buffer, SOME_THREAD_ID); // the JS glue would `postMessage` here
// on another thread identified by SOME_THREAD_ID:
auto buffer = wgpuAccess(sharedBuffer); // the JS glue would use some way of synchronous message receiving

Shared identifier tables

The idea is to essentially represent WebGPU objects as "IDs" that are just numbers and therefore can be copied around and/or used on different threads. In order to actually access an object, the glue code would then have to access some sort of a shared (between threads/workers) table, using that index.

One approach to implement this table would be using SharedArrayBuffer, since this object is already sharable in JS. The glue code would then:

From the user perspective, all the objects become instantly available on other threads. This comes at the cost of resolving the index on accessing each and every object from WASM.

WASM Tables

There is an emerging WASM construct Table that can make the approach of shared tables to be more efficient: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/WebAssembly/Table

It's currently limited to references only, but the WASM WG would be open to expand it to generic objects. If we decide to go this path, I'm told that WASM WG can prioritize the development of Table to suit our needs better.

The benefit of using Table versus SharedArrayBuffer is having less round-trips to JS land, since Table is going to be natively supported by WASM.

kdashg commented 4 years ago

bz gave me some indication that a synchronous recvMessage could be palatable, which might solve most of our problems.


For porting of existing content, we need a solution for a pull-only system, where the destination worker/thread doesn't need to be known ahead of time. For sharable objects, the porting layer could intercept object creation calls and preemptively publish a shareable object to all registered destination workers. For any transferable objects (do we have any anymore?), I don't think we have a good story to offer.

I don't think we need our API to switch to using int handles universally. Porting layers (like emscripten's OpenGL shim) can generally take care of mapping int handles into opaque objects with relative efficiency. This would also be the place where a porting layer would acquire a worker-local reference to a shared resource if it hasn't been created yet. It seems possible to me, so long as we have some synchronous method for "pulling"/opening a shared resource reference on a worker.

One thing we need to be careful about with an OpenResourceByHandle API is containment failure from other components or workers guessing handle values. However, if we make the handles (shareable-)device-specific, I think we're fine.


In talking with some JS people, "can we shared JS objects" was terrifying to them, but if we phrase things in terms of well-defined structured cloning, I think we should be on more stable ground:

SharedCloneDict

A shared-like-SharedArrayBuffer dictionary where get/set for given a key would deserialize/serialize an object as a value. If the value object were a "Cloneable reference" (shareable) type, this would skip the postMessage dance.

One advantage of this is that there's some belief that this is more generally useful, and not something specific or only needed by our API. (so there may be some interest in pushing this forward from additional people outside this group)

This could be used as a sharing mechanism, and when WASM gets Tables, loading from the SharedCloneDict into a worker-local Table could preserve the perf optimizations of Table APIs.

magcius commented 4 years ago

bz gave me some indication that a synchronous recvMessage could be palatable, which might solve most of our problems.

Is there any discussion about this you can point me to? It would be a huge help for anyone wanting to use Web Workers as a porting target, not just for graphics.

juj commented 4 years ago

bz gave me some indication that a synchronous recvMessage could be palatable, which might solve most of our problems.

This would be interesting if this is the case! Me and Alon have been asking for a sync recvMessage in Emscripten since circa 2014-2015.. there's plenty of extra use cases we'd have for it, if that ever becomes a thing!

I don't think we need our API to switch to using int handles universally. Porting layers (like emscripten's OpenGL shim) can generally take care of mapping int handles into opaque objects with relative efficiency.

I was looking at WebGPU Emscripten code today, and my feeling about these mapping layers we write is somewhat the opposite: we can handle mappings with relative _in_efficiency. Currently marshalling the WebGPU descriptor tables is much more awkward than it is with WebGL, as it requires nested JS object tree creation, refcounting, some string comparisons, and it will also mean that the descriptors cannot be shared across Wasm pthreads. WebGL is much better in the marshalling aspect, since it is low level C-like, whereas WebGPU is a higher level API. (also WebGL operation can be managed quite well garbage-free, whereas the Wasm<->JS mapping causes tons of per-frame garbage in hot WebGPU paths)