Open dontcallmedom opened 4 years ago
The idea is that both WebNN and Model Loader APIs can be detected by:
if ("ml" in navigator) { ... }
Both APIs should be polyfillable with JavaScript or WASM implementations. Since the main goal of the ML APIs is to accelerate model execution, it's expected that polyfills will often be too slow to be practical. In that case, it will be up to developers to ensure graceful degradation.
Some strategies:
thanks! I understand that "ml" in navigator
can be used as a first layer of feature detection, but I was wondering about lower-level type of feature detection - e.g. would there be a way to determine ahead of time how well or poorly a given model would operate, e.g. given the available hardware, or given how operations are hardware-accelerated, etc.
Ah, got it. That's likely to be dependent on the specific ML model. It's possible to have some general-purpose JavaScript to parse the model and inspect the operations in the graph to determine whether there are any that would be especially slow, like convolution. It would involve some heuristics.
At least to start, there will be some diversity in hardware and browser support, and that will complicate things. Eventually, hopefully, it will be just 2 or 3 options: latest browser on latest hardware, latest browser on older hardware, older browser.
With regard to your 3 classes strategy (latest/latest,latest/older,older/older) - I wonder if the current pace of evolutions of hardware may make this too simple for the foreseeable future
E.g. in a given "latest" generation of mobile devices, high-end devices might have the fancy ML core with lots of hardware accelerated operations, where lower end devices would not, and as a developer, you would likely want to load a different model (a slow operation in the first place) depending on how many of the operations are available on a given device.
Let me describe other strategies that have been used elsewhere in the platform to see if they could trigger analogies in the ML model space:
There may be other patterns worth comparing with; and many of these patterns raise tricky privacy & fingerprinting questions.
Great, thanks for the links! Agree it will be some time (several years?) before only 3 classes will be enough.
This is a hard topic in general, but particularly hard for ML because of the nature of its diversity, not only at the semantic level of different operations in the network, but also, as you pointed out, at the implementation level with hardware support that is almost always related to the bottom-line performance of the given model.
At the semantic level, ONNX versioning strategy has been to evolve the individual operations independently, but to version them together collectively, the so-called opset model. This mechanism allows the runtime to match the designated opset a model is created for with the opsets the runtime actually supports. And since opsets are evolved only progressively, a runtime could easily build the support that fully maintains backward compatibility with models targeting older opsets with reasonable clarity. This is not a perfect system but at least it answers the question "Can this model be supported?" very efficiently, at least at the semantic level.
The second level of feature detection is data type support. This often requires hardware detection, like in the case of the GPUs. A model of a certain opset may be semantically supported by a runtime but some of the operations in the network require specialized data type e.g. FP16, int8, or bFloat. This would almost certainly require querying the hardware for its native support. Depending on the kind of hardware architecture, falling back to something else other than the requested data type behind the scene could cause detrimental effect to not only the model's execution performance but also its prediction accuracy due to unexpected change in computational precision of the mismatched data type. Models with specialized data type have become more prevalent these days, due to reasons ranging from ease of deployment to memory requirement and execution efficiency. This level of detection is often best carried out by the application framework or the browser through system calls.
And perhaps the third level of detection is for performance requirements. A model designed to run close to the perceptual frame rate e.g. computer visions, etc. may become completely unusable if it's unable to attain that level of throughput in the eyes of the users. Conversely, a model running in a low bandwidth background environment may even prefer a longer duration of execution time if it means requiring less computing power or consuming less memory. These are models such as a background task that carries out predictions of user's behavior to optimize for the device's level of power consumption, etc. One of the strategies often used for this kind of situation is usage profiles or execution policies. Trying to perform runtime analysis with throttling can become very non-deterministic and self-defeating.
I should have also mentioned @jasonmayes' discussion of in-session progressive enhancement where an app would start with a fast-to-load model before replacing it with a heavier but more accurate one.
According to framework use cases of WebNN, a JavaScript ML framework should be able to detect the operations supported by WebNN and adapt to that.
In practice, based on the detected operations, a framework could partition the model, create sub-graphs that contain the supported operations and delegate them to WebNN for hardware acceleration. As mentioned in slide 9 of @jbingham 's talk, we found that even running one or two compute-intensive operations of a model by WebNN can lead to much faster performance. The performance could be improved with more operations supported by WebNN progressively.
For unsupported operations, the framework can fallback to execute the kernels written in WebAssembly and WebGL/WebGPU. For the graceful degradation, it would require an efficient tensor data exchange mechanism between WebNN graph execution with WebAssembly, WebGL/WebGPU kernels' execution. The coordination with WebAssembly and WebGPU groups would be needed.
There are more details in Custom operations discussion of WebNN github repo.
Adding @wchao1115 @gramalingam @RafaelCintron @pyu10055 @dsmilkov @nsthorat and @anssiko for thoughts and visibility.
Re fallback mechanisms if a specific op is not supported natively: we ran an experiment a while back that partitions a simple neural network into sub-graphs for execution across WebNN API and WebGPU API.
The results seem to suggest (@huningxin to correct me) there's a good speedup even if WebNN outputs had to be uploaded to GPUBuffer in user code to fuse these two APIs together. Further performance improvement was observed if WebNN API could write to a GPUBuffer directly.
The actionable things from this experiment for consideration re Web API surface and coordination at that time were:
GPUDevice
GPUBuffer
inputs and outputsTo make this feasible the data exchange between the WebNN API built-in ops and custom ops (in WebGPU compute shader, Wasm) must be performant.
WebNN-WebGPU interop would be a partial solution to the question of progressive enhancement / graceful degradation. It'd solve an issue of a GPU-powered device that has a browser with a partial implementation of the WebNN API (that is expected to evolve over time) and a WebGPU API implementation (that is expected to remain more stable when it reaches its feature completeness).
Our experience building DirectML and the frameworks that use it has been that falling back through deep interop or custom operators is rarely the path ML developers want to take. Both ONNX runtime and WinML have defined the mechanisms in the frameworks to accept custom operators written by developers for when the standard operators are insufficient to their use cases. But it was never used, or at least there was no known use case of it that we know of today. So, I have a healthy amount of skepticism around WebNN/WebGPU interop as a practical tool for ML scenario. The complexity involved in creating this extensibility point in the API and the maintenance cost of this contract over time is very significant, but with virtually no proven use case in reality.
A much more promising path and one that are used much more frequently, is operator composition i.e. breaking down a bigger operator into graphs of smaller ones, when the big one aren't sufficiently supported by the framework. Performance-wise this is also an approach with much less friction and stutters since the currency that flows through the graph is still the same currency that is managed throughout in a uniform way, as opposed to interop readbacks between buffers which could still be very expensive even with between two different GPU resources due to synchronization and scheduling needs.
We heavily leverage operator composition in our work to layer TensorFlow on DirectML. With literally over a thousand kernels implemented in TensorFlow, it is virtually impossible to rewrite everything onto a new backend. Take RNN for example, there're so many ways to slightly extend it, but no one would want to write the entire implementation of RNN from scratch every time a small extension is needed. With composition it's relatively easy to extend what already existed without a significant cost of testing an entirely new custom kernel.
WebNN is designed with operator composition in mind from the beginning because it needs to address the issue of API longevity. Its low level operators e.g. matmul etc. are meant to serve as building blocks of the neural network graphs, present and future, while the high level operators are defined with performance and ease of use in mind.
In my talk I allude to a long-standing approach in Web technologies to give tools to developers to allow ideally for progressive enhancement (the ability to bring more features as optional improvements on more powerful devices and browsers), or fallback to graceful degradation (providing a fallback when more advanced or powerful features aren't available).
Many talks highlight the fact that ML is moving fast, with a set of core primitives rapidly evolving.
In his talk, @miaowang14 highlights the strategies the Android NN API has taken to backwards-compatibility and the growth in the operators provided by the API.
How much discussion has there been in the context of the WebNN API and the Model loader API in features detection, and how confident are we this can be used in the context of progressive enhancement / graceful degradation? @huningxin @jbingham