dontcallmedom commented 3 years ago

@jasonmayes raises the question of whether WebGPU exposes the right API surface needed to support ML frameworks interactions with GPUs.

@jasonmayes, do you have a list of specific asks from the TFJS experience?

@grorg @tidoust any insights on this?

tidoust commented 3 years ago

Cc @Kangz

I'm afraid I don't have any insights on this for now.

Kangz commented 3 years ago

WebGPU provide compute shaders, by themselves they allow using "shared workgroup memory" which is nice but not the best you can do in native GPU ML today. Next are subgroup operations that could be a WebGPU extension and that some people are already looking at. And finally there's the cooperative matrix multiply (marketed as "tensor cores" by Nvidia): it might become a WebGPU extension if it becomes supported by more than one HW vendor.

dontcallmedom commented 3 years ago

Thanks @Kangz, very useful! here is the link to the current discussion on subgroup operations for others' benefits.

jasonmayes commented 3 years ago

So one thing we wanted to find out is if there is a way to have garbage collection much like JS currently has, but for GPU related activities too. Right now we made TF.tidy() to somewhat deal with the release of memory when finished but newer users take time to realise this exists and it would be better if consistent with how JS generally functions - most JS devs do not even think about memory management as they are used to the JS garbage collector doing its thing.

jasonmayes commented 3 years ago

Adding @pyu10055 @dsmilkov @nsthorat @annxingyuan @tafsiri @lina128 for any wish list items related to this topic too (TF.js team)

Kangz commented 3 years ago

So one thing we wanted to find out is if there is a way to have garbage collection much like JS currently has, but for GPU related activities too. Right now we made TF.tidy() to somewhat deal with the release of memory when finished but newer users take time to realise this exists and it would be better if consistent with how JS generally functions - most JS devs do not even think about memory management as they are used to the JS garbage collector doing its thing.

There isn't really a way to do automatic GC of GPU resources, and this can be seen in WebGL's gl.deleteX, WebGPU's resource.destroy() or even ImageBitmap.close(). That's because some very small amount of JS objects can hold to large amounts of GPU memory. Either the GC knows about it and will run often to try to reclaim memory (bad for realtime applications and overall perf), or it will just see the JS objects and let GPU objects leak. It's not possible to trigger in Javascript GC when the GPU runs out of memory for many reasons, including that GPU objects are in different process and can't roundtrip to JS to ask for the GC to run.

anssiko commented 3 years ago

Newly added is a SIMD operations in WebGPU for ML talk by @mehmetoguzderin discussing proposed subgroup operations @Kangz mentioned.

@mehmetoguzderin feel free to provide your further perspectives in this issue for workshop discussions. Also please review other workshop talks relevant to this issue as well as the WebNN API spec and its open issues. In particular, the WebNN API issue https://github.com/webmachinelearning/webnn/issues/6 discussing custom ops using lower-level APIs such as WebGPU.

mehmetoguzderin commented 3 years ago

@anssiko WebNN is very interesting; I will have a look at it. And I will provide input in this repository for anything workshop related. Thanks for the mention.

mehmetoguzderin commented 3 years ago

Now a sample code that uses SIMD operations is available in the repository of my talk. For the speed benchmark's chart that compares SIMD to alternative methods, please check out the main README.md, and for the code itself, please check out the samples folder. (Code is written in Vulkan and GLSL but structured enough to give a general idea): https://github.com/mehmetoguzderin/webgpu-20200828-simdgroup

wchao1115 commented 3 years ago

I'd like to offer a different take in response to @jasonmayes' question in his talk:

What lower level support do we need for efficient ML when using the graphics card?

As we know, most meaningful ML accelerations are rooted in the underlying hardware, and the work to surface such capabilities have been concentrating in the OS layer where the actual interaction between the platform tech and the hardware drivers meet. This is true for Windows, Linux, Android and MacOS. It is done this way because the hardware difference in the ecosystem is diverse and that hardware abstraction is a problem the OS is very good at.

WebNN is designed to provide an ML-specific path for the web platform to leverage OS native ML functionality that make use of this hardware acceleration in a more consistent and manageable way. So instead of relying on low-level, general-purpose compute constructs such as WebGL or WebGPU shaders, an ML framework could leverage native ML constructs more directly through an ML-specific web API like WebNN by letting it carry out platform-specific acceleration in the OS layer under the hood.

In the case of DirectML, in addition to providing a very optimized version of the compute-based ML implementation, being an OS component, it also leverages fine-grained interaction with the underlying compute drivers in the OS stack to maximize runtime performance and reduce latency; when appropriate, it provides short-cuts to operation-specific underlying capabilities based on hardware's availability. As discussed in my talk, we've so far been reasonably successful with the integration of DirectML to both ONNX and TensorFlow. DirectML functionality can be mapped through WebNN.

jeffhammond commented 3 years ago

During the Zoom, I asked about whether subgroups were the right level to seek portability, and if it might be better to target a DSL like Halide or MLIR as the portable abstraction layer.

The challenges of making anything at the level of OpenCL subgroups portable are:

the long-standing differences in how SIMD and SIMT are implemented in CPU and GPU hardware, and the lack of consistency in e.g. shuffle instructions.
the introduction of multidimensional SIMD instructions, e.g. NVIDIA Tensor Cores, Intel AMX and Apple AMX.

At least for some ML workloads, the second category are more useful, and a better target than vector operations.

Background

Halide is a Domain-Specific Language (DSL) for image processing and other data parallel computations, including neural network operations (e.g. https://people.csail.mit.edu/tzumao/gradient_halide/).
LIBXSMM is a library for smaller matrix multiplication and convolution operations that uses lightweight just-in-time (JIT) compilation to generate optimal code for each supported architecture. It is created by Intel and focused on AVX-512 and AMX instruction sets.
OpenCL subgroups shows the subgroup interface Intel/Khronos added to OpenCL. Even though OpenCL is portable, the usage of the API is hardware-dependent, which one of my motivations for wondering if a higher level API is better.
Apple AMX (sorry I cannot find official documentation yet) is a set of CPU-based matrix extensions.

mehmetoguzderin commented 3 years ago

Thanks a lot for the feedback, @jeffhammond An essential aspect of the SIMD proposal for WebGPU is the restricted set of operations exposed in itself. For example, shuffle operations and indexed accesses don't exist at all; this stems from the concerns they bring, and because not all target native APIs have those operations.

Demonstrated in the sample I provided for this workshop, even with a safer subset which requires a uniform control, the performance gain can push the bands of 10 times. As people said in the call, they want their GPU execution to be as little as possible when considering embedded or mobile aspects. SIMD operations enable that for very realistic use cases such as exploratory data analysis. And the rougher terrains of these operations are not that extreme (some driver bugs exist) given that atomics and writeable buffers are available in WebGPU. I believe if they are available in MVP, people that work on fantastic higher-level abstractions similar to Halide will squeeze the benefit of SIMD operations and reflect to benefit users that can't invest the time to work on SIMD reductions. But even for such people, SIMD operations bring a benefit because when it comes to reduction, atomic operations only work for integers. In contrast, SIMD operations give access to more types, and they outperform atomics even on integers.

I think exposing the kind of tensor cores is independent of SIMD operations discussion because they are way more recent, and their API surface is a bit different.

kvark commented 3 years ago

For a structured capture of the WebGPU debate on subgroups, one can also have a look at argdown-plain and argdown-component views.

w3c / machine-learning-workshop

WebGPU fitness for ML frameworks #66

Background